From darin.london at duke.edu Mon Jul 3 08:41:33 2006
From: darin.london at duke.edu (Darin London)
Date: Mon, 03 Jul 2006 08:41:33 -0400
Subject: [Bioperl-l] Call For Birds of a Feather Suggestions
Message-ID: <44A9107D.2050304@duke.edu>
The BOSC organizing comittee is currently seeking suggestions for Birds
of a Feather meeting ideas. Birds of a Feather meetings are one of the
more popular activities at BOSC, occurring at the end of each days
session. These are free-form meetings organized by the attendees
themselves to discuss one or a few topics of interest in greater detail.
BOF?s have been formed to allow developers and users of individual OBF
software to meet each other face-to-face to discuss the project, or to
discuss completely new ideas, and even start new software development
projects. These meetings offer a unique opportunity for individuals to
explore more about the activities of the various Open Source Projects,
and, in some cases, even take an active role influencing the future of
Open Source Software development. If you would like to create a BOF,
just sign up for a wiki account, login, and edit the BOSC
2006 Birds of a Feather page.
From bix at sendu.me.uk Wed Jul 5 08:37:34 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 05 Jul 2006 13:37:34 +0100
Subject: [Bioperl-l] checkout_all fails on biodata
Message-ID: <44ABB28E.2000203@sendu.me.uk>
I'm doing:
cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co bioperl_all
to check out all the bioperl packages at once. However it only checks
out core, db, pedigree, pipeline and run before failing on biodata:
cvs checkout: Updating biodata
cvs checkout: failed to create lock directory for
`/home/repository/bioperl/biodata'
(/home/repository/bioperl/biodata/#cvs.lock): Permission denied
cvs checkout: failed to obtain dir lock in repository
`/home/repository/bioperl/biodata'
cvs [checkout aborted]: read lock failed - giving up
This failure is consistent for me (had it multiple times, different
days, never worked).
Biodata isn't even mentioned as a possible package at
http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to the
end of the alias list so it is checked out last, letting all the other
packages be checked out before failure?
PS. neither biodata nor pipeline are mentioned as a package on that wiki
page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are there
yet more packages?
Cheers,
Sendu.
From hlapp at gmx.net Wed Jul 5 08:55:42 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 5 Jul 2006 08:55:42 -0400
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To: <44ABB28E.2000203@sendu.me.uk>
References: <44ABB28E.2000203@sendu.me.uk>
Message-ID: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net>
Should have been fixed - I can cvs update. did you try again?
On Jul 5, 2006, at 8:37 AM, Sendu Bala wrote:
> I'm doing:
>
> cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co
> bioperl_all
>
> to check out all the bioperl packages at once. However it only checks
> out core, db, pedigree, pipeline and run before failing on biodata:
>
> cvs checkout: Updating biodata
> cvs checkout: failed to create lock directory for
> `/home/repository/bioperl/biodata'
> (/home/repository/bioperl/biodata/#cvs.lock): Permission denied
> cvs checkout: failed to obtain dir lock in repository
> `/home/repository/bioperl/biodata'
> cvs [checkout aborted]: read lock failed - giving up
>
> This failure is consistent for me (had it multiple times, different
> days, never worked).
>
> Biodata isn't even mentioned as a possible package at
> http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to
> the
> end of the alias list so it is checked out last, letting all the other
> packages be checked out before failure?
>
> PS. neither biodata nor pipeline are mentioned as a package on that
> wiki
> page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are
> there
> yet more packages?
>
> Cheers,
> Sendu.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From bix at sendu.me.uk Wed Jul 5 09:03:50 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 05 Jul 2006 14:03:50 +0100
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net>
References: <44ABB28E.2000203@sendu.me.uk>
<9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net>
Message-ID: <44ABB8B6.5040707@sendu.me.uk>
Hilmar Lapp wrote:
> Should have been fixed - I can cvs update. did you try again?
Still doesn't work, no change. I can manually check out the other
packages, I just can't do it with bioperl_all alias.
co bioperl-biodata fails because:
cvs server: cannot find module `bioperl-biodata' - ignored
cvs [checkout aborted]: cannot expand modules
(not that I want it - if its no longer a bioperl package can it be
removed from the alias?)
From hlapp at gmx.net Wed Jul 5 09:41:27 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 5 Jul 2006 09:41:27 -0400
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To: <44ABB8B6.5040707@sendu.me.uk>
References: <44ABB28E.2000203@sendu.me.uk>
<9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net>
<44ABB8B6.5040707@sendu.me.uk>
Message-ID:
The idea was once that Bioperl, Biojava, etc had all those unit tests
that use specific sample data which take up quite a bit of space.
Unifying the largely redundant test data into a single shared
repository would save quite a bit of space and therefore download/
update time for people who work on/use more than one Bio* project.
However, this was never fully implemented AFAIK. I.e., you don't need
biodata. I guess it could be removed from the alias since it's not
integrated anyway.
Any other opinions?
I also forwarded your report to root-l as I couldn't find the
offending (stale) lock file.
-hilmar
On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> Should have been fixed - I can cvs update. did you try again?
>
> Still doesn't work, no change. I can manually check out the other
> packages, I just can't do it with bioperl_all alias.
>
> co bioperl-biodata fails because:
> cvs server: cannot find module `bioperl-biodata' - ignored
> cvs [checkout aborted]: cannot expand modules
>
> (not that I want it - if its no longer a bioperl package can it be
> removed from the alias?)
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Wed Jul 5 09:48:03 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 5 Jul 2006 08:48:03 -0500
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To: <44ABB8B6.5040707@sendu.me.uk>
Message-ID: <000f01c6a039$a7a24f10$15327e82@pyrimidine>
Bioperl-data was a directory started up a few years ago to hold various data
files for testing and as examples (BLAST file examples, GenBank files, etc),
somewhat like the t/data directory but cleaned up a bit more. It hasn't
been updated in a while. Regardless, you should be able to check it out.
As for the problem, looks like Hilmar's checking up on a possible lock file
issue.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Wednesday, July 05, 2006 8:04 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] checkout_all fails on biodata
>
> Hilmar Lapp wrote:
> > Should have been fixed - I can cvs update. did you try again?
>
> Still doesn't work, no change. I can manually check out the other
> packages, I just can't do it with bioperl_all alias.
>
> co bioperl-biodata fails because:
> cvs server: cannot find module `bioperl-biodata' - ignored
> cvs [checkout aborted]: cannot expand modules
>
> (not that I want it - if its no longer a bioperl package can it be
> removed from the alias?)
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Wed Jul 5 11:06:30 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 5 Jul 2006 10:06:30 -0500
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To:
Message-ID: <001901c6a044$999a14b0$15327e82@pyrimidine>
I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has:
---------------------------
In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf"
"checkout" "-P" "bioperl_all"
CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl
...
cvs checkout: failed to create lock directory for
`/home/repository/bioperl/biodata'
(/home/repository/bioperl/biodata/#cvs.lock): Permission denied
cvs checkout: failed to obtain dir lock in repository
`/home/repository/bioperl/biodata'
cvs [checkout aborted]: read lock failed - giving up
cvs.exe checkout: in directory bioperl:
cvs.exe checkout: cannot open CVS/Entries for reading: No such file or
directory
---------------------------
I had the same problem with schema (BioSQL) a while back. I tried again,
and...
---------------------------
cvs checkout: failed to create lock directory for
`/home/repository/bioperl/biosql-schema'
(/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied
cvs checkout: failed to obtain dir lock in repository
`/home/repository/bioperl/biosql-schema'
cvs [checkout aborted]: read lock failed - giving up
cvs.exe checkout: in directory .:
cvs.exe checkout: cannot open CVS/Entries for reading: No such file or
directory
---------------------------
I believe it had something to do with CVS commit privileges (i.e. I had none
for schema, which was fine). So maybe this is a permissions issue via the
lock file? Looking at the alias:
bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema
&network µarray
This may mean if anyone w/o commit privs for any of the above (specifically
schema and biodata) tries checkout/update using bioperl-all, they may run
into this problem.
Since it's not integrated I don't see the problem with removing it from the
alias, but if we follow the same line of logic (and privileges are the
issue) then schema must be removed as well. To me it doesn't make much
sense to not include schema though since we can checkout/update bioperl-db.
BTW, I like the idea of biodata as you've outlined it. Would be nice to
gear the test suite towards a more general set of data for all the Bio*
projects versus having each one come with their own, and the data could be
updated a bit more frequently that t/data is. Seems like it would
definitely save a large chunk of real estate for the distributions. If one
wanted to run the full test suite then they would have to download biodata
separately, though, but not a bad compromise. Though, if this is/was its
intent, why would it need a lock file?
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Wednesday, July 05, 2006 8:41 AM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] checkout_all fails on biodata
>
> The idea was once that Bioperl, Biojava, etc had all those unit tests
> that use specific sample data which take up quite a bit of space.
> Unifying the largely redundant test data into a single shared
> repository would save quite a bit of space and therefore download/
> update time for people who work on/use more than one Bio* project.
>
> However, this was never fully implemented AFAIK. I.e., you don't need
> biodata. I guess it could be removed from the alias since it's not
> integrated anyway.
>
> Any other opinions?
>
> I also forwarded your report to root-l as I couldn't find the
> offending (stale) lock file.
>
> -hilmar
>
> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote:
>
> > Hilmar Lapp wrote:
> >> Should have been fixed - I can cvs update. did you try again?
> >
> > Still doesn't work, no change. I can manually check out the other
> > packages, I just can't do it with bioperl_all alias.
> >
> > co bioperl-biodata fails because:
> > cvs server: cannot find module `bioperl-biodata' - ignored
> > cvs [checkout aborted]: cannot expand modules
> >
> > (not that I want it - if its no longer a bioperl package can it be
> > removed from the alias?)
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Wed Jul 5 11:36:33 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 5 Jul 2006 10:36:33 -0500
Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour
In-Reply-To:
Message-ID: <001a01c6a048$cb802420$15327e82@pyrimidine>
Okay, I managed to figure out what the problem was. I committed a fix in
CVS for the initial bug (Selvi's missing hits). Still has one HSP per hit
for now; I think it will take a bit more effort to get a BLAST-like multi
HSP/hit up and running.
Selvi, update from CVS to see if that works.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Friday, June 30, 2006 12:44 PM
> To: Sendu Bala; Jason Stajich
> Cc: bioperl-l at lists.open-bio.org list
> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour
>
> I'll try looking at it this weekend. A suggested workaround is to
> either try setting -A for no alignments or setting it to a high
> number to retrieve all of them. It's pretty serious as the error
> silently dumps those domains, so for those using automated annotation
> pipelines would miss it unless they are also checking the raw output.
>
> You could design a SearchIO::hmmpfam parser then expand it to take in
> hmmsearch output at a later point, or keep them separate. I like the
> idea of having modules that are more specific about what they parse;
> seems at some point you reach serious code bloat and maintenance
> becomes an issue. Look at SearchIO::blast; it parses various text
> BLAST output very well but with some serious obfuscation. Just don't
> know how productive it would be to separate out the PSI-BLAST and
> bl2seq stuff since they are pretty close to a standard BLAST
> report... oh well.
>
> To Jason : good luck on your move. Drop us a line here to let us
> know everything went well.
>
> Chris
>
> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote:
>
> > Chris Fields wrote:
> >> It may have been just simpler to have it be one HSP (domain) per Hit
> >> (model) as that's how the reports are generated. My reasoning was
> >> that
> >> using the one domain per model made sense based on what you are
> >> actually
> >> trying to do, which is annotate the sequence based on the order the
> >> domain appears. Most others may not view it that way, which is fine.
> >> One can always gather the relevant HSP's, convert to seqfeatures,
> >> then
> >> sort them if order is important, I suppose.
> >>
> >> I would say, if the overall consensus is to modify it to have
> >> multiple
> >> domain hits per model (similar to BLAST) then Sendu should go
> >> ahead and
> >> make those changes then announce it on the list so no one can gripe
> >> about it later. My main concern was not changing things so
> >> dramatically
> >> that it'll break for someone
> >
> > Going on your earlier suggestion, I was thinking about making
> > SearchIO::hmmpfam instead, which would get used if you set the
> > format to
> > 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I
> > suppose I would make a SearchIO::hmmsearch as well, if necessary.
> >
> >
> > [...]
> >> that the reported bug about missing hits (Bug 2036) is fixed as well.
> >
> > However, having never made a SearchIO plugin before, it will be some
> > time before I get my head around it. I'll want to make one the current
> > HOWTO:SearchIO way before I can think about doing it a better way
> > (hashes) as well. So I can say I'll make a move on this at some
> > point in
> > the future, but if someone wants to fix Bug 2036 in the mean time,
> > they
> > are welcome to. Again as suggested, my priority is Bio::Map right now.
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From arareko at campus.iztacala.unam.mx Wed Jul 5 11:38:14 2006
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Wed, 05 Jul 2006 10:38:14 -0500
Subject: [Bioperl-l] checkout_all fails on biodata
In-Reply-To: <001901c6a044$999a14b0$15327e82@pyrimidine>
References: <001901c6a044$999a14b0$15327e82@pyrimidine>
Message-ID: <44ABDCE6.7090906@campus.iztacala.unam.mx>
Same problem here. I've never used the bioperl_all alias before (I
always check-out dirs individually), but to me it seems like a
privileges issue as Chris suggests.
Also browsed through all the repository in dev.open-bio.org and didn't
found such lock file. I guess Chris D. or Jason will know better what's
happening here.
Mauricio.
Chris Fields wrote:
> I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has:
> ---------------------------
> In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf"
> "checkout" "-P" "bioperl_all"
> CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl
>
> ...
>
> cvs checkout: failed to create lock directory for
> `/home/repository/bioperl/biodata'
> (/home/repository/bioperl/biodata/#cvs.lock): Permission denied
> cvs checkout: failed to obtain dir lock in repository
> `/home/repository/bioperl/biodata'
> cvs [checkout aborted]: read lock failed - giving up
> cvs.exe checkout: in directory bioperl:
> cvs.exe checkout: cannot open CVS/Entries for reading: No such file or
> directory
> ---------------------------
>
> I had the same problem with schema (BioSQL) a while back. I tried again,
> and...
>
> ---------------------------
> cvs checkout: failed to create lock directory for
> `/home/repository/bioperl/biosql-schema'
> (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied
> cvs checkout: failed to obtain dir lock in repository
> `/home/repository/bioperl/biosql-schema'
> cvs [checkout aborted]: read lock failed - giving up
> cvs.exe checkout: in directory .:
> cvs.exe checkout: cannot open CVS/Entries for reading: No such file or
> directory
> ---------------------------
>
> I believe it had something to do with CVS commit privileges (i.e. I had none
> for schema, which was fine). So maybe this is a permissions issue via the
> lock file? Looking at the alias:
>
> bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema
> &network µarray
>
> This may mean if anyone w/o commit privs for any of the above (specifically
> schema and biodata) tries checkout/update using bioperl-all, they may run
> into this problem.
>
> Since it's not integrated I don't see the problem with removing it from the
> alias, but if we follow the same line of logic (and privileges are the
> issue) then schema must be removed as well. To me it doesn't make much
> sense to not include schema though since we can checkout/update bioperl-db.
>
>
> BTW, I like the idea of biodata as you've outlined it. Would be nice to
> gear the test suite towards a more general set of data for all the Bio*
> projects versus having each one come with their own, and the data could be
> updated a bit more frequently that t/data is. Seems like it would
> definitely save a large chunk of real estate for the distributions. If one
> wanted to run the full test suite then they would have to download biodata
> separately, though, but not a bad compromise. Though, if this is/was its
> intent, why would it need a lock file?
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
>> Sent: Wednesday, July 05, 2006 8:41 AM
>> To: Sendu Bala
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] checkout_all fails on biodata
>>
>> The idea was once that Bioperl, Biojava, etc had all those unit tests
>> that use specific sample data which take up quite a bit of space.
>> Unifying the largely redundant test data into a single shared
>> repository would save quite a bit of space and therefore download/
>> update time for people who work on/use more than one Bio* project.
>>
>> However, this was never fully implemented AFAIK. I.e., you don't need
>> biodata. I guess it could be removed from the alias since it's not
>> integrated anyway.
>>
>> Any other opinions?
>>
>> I also forwarded your report to root-l as I couldn't find the
>> offending (stale) lock file.
>>
>> -hilmar
>>
>> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote:
>>
>>> Hilmar Lapp wrote:
>>>> Should have been fixed - I can cvs update. did you try again?
>>> Still doesn't work, no change. I can manually check out the other
>>> packages, I just can't do it with bioperl_all alias.
>>>
>>> co bioperl-biodata fails because:
>>> cvs server: cannot find module `bioperl-biodata' - ignored
>>> cvs [checkout aborted]: cannot expand modules
>>>
>>> (not that I want it - if its no longer a bioperl package can it be
>>> removed from the alias?)
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM
From bix at sendu.me.uk Thu Jul 6 04:41:57 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 06 Jul 2006 09:41:57 +0100
Subject: [Bioperl-l] Bio::Map changes
In-Reply-To: <449A9AF9.2000305@sendu.me.uk>
References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk>
Message-ID: <44ACCCD5.3030309@sendu.me.uk>
Sendu Bala wrote:
> The next step is to tidy up all of Bio::Map*, which involves a major
> reimplementation of the whole system [...]
> The reimplementation will make Position central to the model, allowing
> for lots of other things to work properly without anything becoming
> inconsistent (as is currently the case).
This is now done. It uses a new PositionHandler class behind the scenes.
The next step is to introduce relative positioning across the board,
possibly in a way that makes OrderedPosition redundant or an implementer
of the system.
Has anyone here ever used Bio::Map* modules for anything? I would
appreciate you sending me your code, especially if you've used MapIO,
Physical (encompassing Clone, Contig, FPCMarker,
OrderedPositionWithDistance) or LinkageMap (encompassing
LinkagePosition, OrderedPosition, Microsatellite) since these have
insufficient tests at the moment.
From nidage at yahoo.com Thu Jul 6 14:13:12 2006
From: nidage at yahoo.com (sss lll)
Date: Thu, 6 Jul 2006 11:13:12 -0700 (PDT)
Subject: [Bioperl-l] PrimarySeqI object Exception
Message-ID: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com>
Hi there,
I encountered a problem while calling module
PrimarySeqI, with the following code:
my $db=Bio::DB::Fasta->new($fafile);
my $obj=$db->get_Seq_by_id($array_gene_name[$p]);
$seqio->write_seq($obj);
The error message was:
MSG: Did not provide a valid Bio::PrimarySeqI object
STACK Bio::SeqIO::fasta::write_seq
/usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178
We think it had to do with the lengh of the gene name.
For example the following gene name was a problem:
gi|59711891|ref|YP_204667.1| NAD-specific glutamate
dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E
Any ideas on what happened?
Thanks
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
From rmb32 at cornell.edu Thu Jul 6 19:11:00 2006
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 06 Jul 2006 16:11:00 -0700
Subject: [Bioperl-l] parser for GeneSeqer
In-Reply-To: <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu>
References: <44A558F2.2050304@cornell.edu>
<2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu>
Message-ID: <44AD9884.6040507@cornell.edu>
The Annotation/Annotatable stuff was going to be talked about at the
GMOD meeting that just happened, wasn't it? What's the scoop on that?
Rob
Chris Fields wrote:
> If you plan on generating seqfeatures from this output you could check
> out the Bio::Tools core modules for examples. There are a few there
> that take program output and convert them to Bio::SeqFeature::Generic
> objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If
> alignments are involved you might want something like
> Bio::SeqFeature::FeaturePair. Not sure about using the
> SeqFeature::Annotation or others; I thought that the some of the
> Annotation/Annotatable stuff might be changing soon but I may be wrong.
>
> Chris
>
> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote:
>
>> Hi all,
>>
>> I find myself needing a parser for GeneSeqer output, so I'm writing one
>> (which I will submit for your consideration when it's working). In a
>> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of
>> ESTs to genomic sequence, then using those alignments to predict where
>> in the genomic sequence the genes are. So really what you get from this
>> is a bunch of hierarchical features.
>>
>> I don't really know where I should put it in the bioperl hierarchy
>> though. Probably FeatureIO?
>>
>> And what's the current fashion for objects it should emit?
>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated?
>>
>> Rob
>>
>> --Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
From hlapp at gmx.net Thu Jul 6 19:27:31 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 6 Jul 2006 19:27:31 -0400
Subject: [Bioperl-l] parser for GeneSeqer
In-Reply-To: <44AD9884.6040507@cornell.edu>
References: <44A558F2.2050304@cornell.edu>
<2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu>
<44AD9884.6040507@cornell.edu>
Message-ID: <6B530ED6-5825-47C4-A677-2C75E0F97E26@gmx.net>
No scoop b/c no time. I am busy w/ a grant and Lincoln had to leave
early as well on Friday. Sorry.
On Jul 6, 2006, at 7:11 PM, Robert Buels wrote:
> The Annotation/Annotatable stuff was going to be talked about at the
> GMOD meeting that just happened, wasn't it? What's the scoop on that?
>
> Rob
>
>
> Chris Fields wrote:
>> If you plan on generating seqfeatures from this output you could
>> check
>> out the Bio::Tools core modules for examples. There are a few there
>> that take program output and convert them to Bio::SeqFeature::Generic
>> objects, including Bio::Tools:RNAMotif and
>> Bio::Tools::tRNAscanSE. If
>> alignments are involved you might want something like
>> Bio::SeqFeature::FeaturePair. Not sure about using the
>> SeqFeature::Annotation or others; I thought that the some of the
>> Annotation/Annotatable stuff might be changing soon but I may be
>> wrong.
>>
>> Chris
>>
>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote:
>>
>>> Hi all,
>>>
>>> I find myself needing a parser for GeneSeqer output, so I'm
>>> writing one
>>> (which I will submit for your consideration when it's working).
>>> In a
>>> nutshell, GeneSeqer is a (kind of old) program for aligning a
>>> bunch of
>>> ESTs to genomic sequence, then using those alignments to predict
>>> where
>>> in the genomic sequence the genes are. So really what you get
>>> from this
>>> is a bunch of hierarchical features.
>>>
>>> I don't really know where I should put it in the bioperl hierarchy
>>> though. Probably FeatureIO?
>>>
>>> And what's the current fashion for objects it should emit?
>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated?
>>>
>>> Rob
>>>
>>> --Robert Buels
>>> SGN Bioinformatics Analyst
>>> 252A Emerson Hall, Cornell University
>>> Ithaca, NY 14853
>>> Tel: 503-889-8539
>>> rmb32 at cornell.edu
>>> http://www.sgn.cornell.edu
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Thu Jul 6 19:28:09 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 6 Jul 2006 18:28:09 -0500
Subject: [Bioperl-l] parser for GeneSeqer
In-Reply-To: <44AD9884.6040507@cornell.edu>
Message-ID: <000001c6a153$d78b83c0$15327e82@pyrimidine>
Not any word yet. Been pretty quiet, likely b/c everybody was there
planning a roadmap for v1.6.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> Sent: Thursday, July 06, 2006 6:11 PM
> To: bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] parser for GeneSeqer
>
> The Annotation/Annotatable stuff was going to be talked about at the
> GMOD meeting that just happened, wasn't it? What's the scoop on that?
>
> Rob
>
>
> Chris Fields wrote:
> > If you plan on generating seqfeatures from this output you could check
> > out the Bio::Tools core modules for examples. There are a few there
> > that take program output and convert them to Bio::SeqFeature::Generic
> > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If
> > alignments are involved you might want something like
> > Bio::SeqFeature::FeaturePair. Not sure about using the
> > SeqFeature::Annotation or others; I thought that the some of the
> > Annotation/Annotatable stuff might be changing soon but I may be wrong.
> >
> > Chris
> >
> > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote:
> >
> >> Hi all,
> >>
> >> I find myself needing a parser for GeneSeqer output, so I'm writing one
> >> (which I will submit for your consideration when it's working). In a
> >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of
> >> ESTs to genomic sequence, then using those alignments to predict where
> >> in the genomic sequence the genes are. So really what you get from
> this
> >> is a bunch of hierarchical features.
> >>
> >> I don't really know where I should put it in the bioperl hierarchy
> >> though. Probably FeatureIO?
> >>
> >> And what's the current fashion for objects it should emit?
> >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated?
> >>
> >> Rob
> >>
> >> --Robert Buels
> >> SGN Bioinformatics Analyst
> >> 252A Emerson Hall, Cornell University
> >> Ithaca, NY 14853
> >> Tel: 503-889-8539
> >> rmb32 at cornell.edu
> >> http://www.sgn.cornell.edu
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Thu Jul 6 19:41:44 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 6 Jul 2006 19:41:44 -0400
Subject: [Bioperl-l] parser for GeneSeqer
In-Reply-To: <000001c6a153$d78b83c0$15327e82@pyrimidine>
References: <000001c6a153$d78b83c0$15327e82@pyrimidine>
Message-ID:
Uhm - roadmap - I guess yes, but more that of the Golden State, or
other states on the way, for Jason.
On Jul 6, 2006, at 7:28 PM, Chris Fields wrote:
> Not any word yet. Been pretty quiet, likely b/c everybody was there
> planning a roadmap for v1.6.
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Thursday, July 06, 2006 6:11 PM
>> To: bioperl-l at bioperl.org
>> Subject: Re: [Bioperl-l] parser for GeneSeqer
>>
>> The Annotation/Annotatable stuff was going to be talked about at the
>> GMOD meeting that just happened, wasn't it? What's the scoop on
>> that?
>>
>> Rob
>>
>>
>> Chris Fields wrote:
>>> If you plan on generating seqfeatures from this output you could
>>> check
>>> out the Bio::Tools core modules for examples. There are a few there
>>> that take program output and convert them to
>>> Bio::SeqFeature::Generic
>>> objects, including Bio::Tools:RNAMotif and
>>> Bio::Tools::tRNAscanSE. If
>>> alignments are involved you might want something like
>>> Bio::SeqFeature::FeaturePair. Not sure about using the
>>> SeqFeature::Annotation or others; I thought that the some of the
>>> Annotation/Annotatable stuff might be changing soon but I may be
>>> wrong.
>>>
>>> Chris
>>>
>>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote:
>>>
>>>> Hi all,
>>>>
>>>> I find myself needing a parser for GeneSeqer output, so I'm
>>>> writing one
>>>> (which I will submit for your consideration when it's working).
>>>> In a
>>>> nutshell, GeneSeqer is a (kind of old) program for aligning a
>>>> bunch of
>>>> ESTs to genomic sequence, then using those alignments to predict
>>>> where
>>>> in the genomic sequence the genes are. So really what you get from
>> this
>>>> is a bunch of hierarchical features.
>>>>
>>>> I don't really know where I should put it in the bioperl hierarchy
>>>> though. Probably FeatureIO?
>>>>
>>>> And what's the current fashion for objects it should emit?
>>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated?
>>>>
>>>> Rob
>>>>
>>>> --Robert Buels
>>>> SGN Bioinformatics Analyst
>>>> 252A Emerson Hall, Cornell University
>>>> Ithaca, NY 14853
>>>> Tel: 503-889-8539
>>>> rmb32 at cornell.edu
>>>> http://www.sgn.cornell.edu
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Thu Jul 6 19:49:23 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 6 Jul 2006 18:49:23 -0500
Subject: [Bioperl-l] parser for GeneSeqer
In-Reply-To:
Message-ID: <000101c6a156$cee60bc0$15327e82@pyrimidine>
Oh well. There's always BOSC...
Chris
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Thursday, July 06, 2006 6:42 PM
> To: Chris Fields
> Cc: 'Robert Buels'; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] parser for GeneSeqer
>
> Uhm - roadmap - I guess yes, but more that of the Golden State, or
> other states on the way, for Jason.
>
> On Jul 6, 2006, at 7:28 PM, Chris Fields wrote:
>
> > Not any word yet. Been pretty quiet, likely b/c everybody was there
> > planning a roadmap for v1.6.
> >
> > Chris
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> >> Sent: Thursday, July 06, 2006 6:11 PM
> >> To: bioperl-l at bioperl.org
> >> Subject: Re: [Bioperl-l] parser for GeneSeqer
> >>
> >> The Annotation/Annotatable stuff was going to be talked about at the
> >> GMOD meeting that just happened, wasn't it? What's the scoop on
> >> that?
> >>
> >> Rob
> >>
> >>
> >> Chris Fields wrote:
> >>> If you plan on generating seqfeatures from this output you could
> >>> check
> >>> out the Bio::Tools core modules for examples. There are a few there
> >>> that take program output and convert them to
> >>> Bio::SeqFeature::Generic
> >>> objects, including Bio::Tools:RNAMotif and
> >>> Bio::Tools::tRNAscanSE. If
> >>> alignments are involved you might want something like
> >>> Bio::SeqFeature::FeaturePair. Not sure about using the
> >>> SeqFeature::Annotation or others; I thought that the some of the
> >>> Annotation/Annotatable stuff might be changing soon but I may be
> >>> wrong.
> >>>
> >>> Chris
> >>>
> >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I find myself needing a parser for GeneSeqer output, so I'm
> >>>> writing one
> >>>> (which I will submit for your consideration when it's working).
> >>>> In a
> >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a
> >>>> bunch of
> >>>> ESTs to genomic sequence, then using those alignments to predict
> >>>> where
> >>>> in the genomic sequence the genes are. So really what you get from
> >> this
> >>>> is a bunch of hierarchical features.
> >>>>
> >>>> I don't really know where I should put it in the bioperl hierarchy
> >>>> though. Probably FeatureIO?
> >>>>
> >>>> And what's the current fashion for objects it should emit?
> >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated?
> >>>>
> >>>> Rob
> >>>>
> >>>> --Robert Buels
> >>>> SGN Bioinformatics Analyst
> >>>> 252A Emerson Hall, Cornell University
> >>>> Ithaca, NY 14853
> >>>> Tel: 503-889-8539
> >>>> rmb32 at cornell.edu
> >>>> http://www.sgn.cornell.edu
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>> Christopher Fields
> >>> Postdoctoral Researcher
> >>> Lab of Dr. Robert Switzer
> >>> Dept of Biochemistry
> >>> University of Illinois Urbana-Champaign
> >>>
> >>>
> >>>
> >>
> >> --
> >> Robert Buels
> >> SGN Bioinformatics Analyst
> >> 252A Emerson Hall, Cornell University
> >> Ithaca, NY 14853
> >> Tel: 503-889-8539
> >> rmb32 at cornell.edu
> >> http://www.sgn.cornell.edu
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
From osborne1 at optonline.net Thu Jul 6 21:06:32 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Thu, 06 Jul 2006 21:06:32 -0400
Subject: [Bioperl-l] PrimarySeqI object Exception
In-Reply-To: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com>
Message-ID:
sss lll,
What this error means is that $obj is not a valid Sequence object, this is
what's passed to the write_seq method. What identifier is
$array_gene_name[$p]?
Brian O.
On 7/6/06 2:13 PM, "sss lll" wrote:
> Hi there,
>
> I encountered a problem while calling module
> PrimarySeqI, with the following code:
>
> my $db=Bio::DB::Fasta->new($fafile);
> my $obj=$db->get_Seq_by_id($array_gene_name[$p]);
> $seqio->write_seq($obj);
>
> The error message was:
> MSG: Did not provide a valid Bio::PrimarySeqI object
> STACK Bio::SeqIO::fasta::write_seq
> /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178
>
> We think it had to do with the lengh of the gene name.
> For example the following gene name was a problem:
>
> gi|59711891|ref|YP_204667.1| NAD-specific glutamate
> dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E
>
> Any ideas on what happened?
>
> Thanks
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From rmb32 at cornell.edu Thu Jul 6 21:24:40 2006
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 06 Jul 2006 18:24:40 -0700
Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge
Message-ID: <44ADB7D8.7080102@cornell.edu>
I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago):
rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v
This is perl, v5.8.4 built for i386-linux-thread-multi
Copyright 1987-2004, Larry Wall
Perl may be copied only under the terms of either the Artistic License
or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t
1..22
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
Can't locate object method "get_Annotations" via package
"Bio::SeqFeature::Annotated" at
/usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292,
line 2.
ok 7 # Cannot complete FeatureIO tests
ok 8 # Cannot complete FeatureIO tests
ok 9 # Cannot complete FeatureIO tests
ok 10 # Cannot complete FeatureIO tests
ok 11 # Cannot complete FeatureIO tests
ok 12 # Cannot complete FeatureIO tests
ok 13 # Cannot complete FeatureIO tests
ok 14 # Cannot complete FeatureIO tests
ok 15 # Cannot complete FeatureIO tests
ok 16 # Cannot complete FeatureIO tests
ok 17 # Cannot complete FeatureIO tests
ok 18 # Cannot complete FeatureIO tests
ok 19 # Cannot complete FeatureIO tests
ok 20 # Cannot complete FeatureIO tests
ok 21 # Cannot complete FeatureIO tests
ok 22 # Cannot complete FeatureIO tests
However, same code runs fine on my debian unstable machine (perl
5.8.8). Perhaps this is a bug in debian stable's perl?
I did some poking around through the code, changing @ISA = qw/.../ to
use base, switching the order of inclusion in the ISA at the top of
Bio::SeqFeature::Annotated, no dice.
Anybody able to reproduce this? Anyone have any ideas?
Rob
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
From cjfields at uiuc.edu Thu Jul 6 22:30:25 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 6 Jul 2006 21:30:25 -0500
Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge
In-Reply-To: <44ADB7D8.7080102@cornell.edu>
Message-ID: <000001c6a16d$4dd7e6e0$15327e82@pyrimidine>
I don't get any issues (all tests pass), except a few warning messages which
is normal; some ontology handlind not implemented.
Usually when running tests I use 'perl -I. t/test.t' to force it to use the
core directory first. You might try that to see if it 'fixes' the problem.
If it does, there may be another bioperl installation in @INC being used
instead of your current directory.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> Sent: Thursday, July 06, 2006 8:25 PM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge
>
> I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago):
>
>
> rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v
>
> This is perl, v5.8.4 built for i386-linux-thread-multi
>
> Copyright 1987-2004, Larry Wall
>
> Perl may be copied only under the terms of either the Artistic License
> or the
> GNU General Public License, which may be found in the Perl 5 source kit.
>
> Complete documentation for Perl, including FAQ lists, should be found on
> this system using `man perl' or `perldoc perl'. If you have access to the
> Internet, point your browser at http://www.perl.com/, the Perl Home Page.
>
> rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t
> 1..22
> ok 1
> ok 2
> ok 3
> ok 4
> ok 5
> ok 6
> Can't locate object method "get_Annotations" via package
> "Bio::SeqFeature::Annotated" at
> /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292,
> line 2.
> ok 7 # Cannot complete FeatureIO tests
> ok 8 # Cannot complete FeatureIO tests
> ok 9 # Cannot complete FeatureIO tests
> ok 10 # Cannot complete FeatureIO tests
> ok 11 # Cannot complete FeatureIO tests
> ok 12 # Cannot complete FeatureIO tests
> ok 13 # Cannot complete FeatureIO tests
> ok 14 # Cannot complete FeatureIO tests
> ok 15 # Cannot complete FeatureIO tests
> ok 16 # Cannot complete FeatureIO tests
> ok 17 # Cannot complete FeatureIO tests
> ok 18 # Cannot complete FeatureIO tests
> ok 19 # Cannot complete FeatureIO tests
> ok 20 # Cannot complete FeatureIO tests
> ok 21 # Cannot complete FeatureIO tests
> ok 22 # Cannot complete FeatureIO tests
>
> However, same code runs fine on my debian unstable machine (perl
> 5.8.8). Perhaps this is a bug in debian stable's perl?
>
> I did some poking around through the code, changing @ISA = qw/.../ to
> use base, switching the order of inclusion in the ISA at the top of
> Bio::SeqFeature::Annotated, no dice.
>
> Anybody able to reproduce this? Anyone have any ideas?
>
> Rob
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From chandan.kr.singh at gmail.com Fri Jul 7 01:23:40 2006
From: chandan.kr.singh at gmail.com (CHANDAN SINGH)
Date: Fri, 7 Jul 2006 10:53:40 +0530
Subject: [Bioperl-l] PrimarySeqI object Exception
In-Reply-To:
References: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com>
Message-ID: <2d4f320607062223y520a1375lb30cf40c1c883702@mail.gmail.com>
Hi
By default , id is the first word encountered i.e, the first string after
">" seperated from the rest by a space. The sample id u mentioned in ur
first mail contains spaces and as i mentioned in my previous mail, i am sure
the ids made by indexing and the ones u r using in the array are different.
U can see the ids used in indexing by using
@ids = $db->ids() ;
print join("\n", at ids) ;
Cheers
Chandan
On 7/7/06, Brian Osborne wrote:
>
> sss lll,
>
> What this error means is that $obj is not a valid Sequence object, this is
> what's passed to the write_seq method. What identifier is
> $array_gene_name[$p]?
>
> Brian O.
>
>
> On 7/6/06 2:13 PM, "sss lll" wrote:
>
> > Hi there,
> >
> > I encountered a problem while calling module
> > PrimarySeqI, with the following code:
> >
> > my $db=Bio::DB::Fasta->new($fafile);
> > my $obj=$db->get_Seq_by_id($array_gene_name[$p]);
> > $seqio->write_seq($obj);
> >
> > The error message was:
> > MSG: Did not provide a valid Bio::PrimarySeqI object
> > STACK Bio::SeqIO::fasta::write_seq
> > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178
> >
> > We think it had to do with the lengh of the gene name.
> > For example the following gene name was a problem:
> >
> > gi|59711891|ref|YP_204667.1| NAD-specific glutamate
> > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E
> >
> > Any ideas on what happened?
> >
> > Thanks
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From selvik at ufl.edu Fri Jul 7 12:07:03 2006
From: selvik at ufl.edu (Selvi Kadirvel)
Date: Fri, 7 Jul 2006 12:07:03 -0400
Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour
In-Reply-To: <001a01c6a048$cb802420$15327e82@pyrimidine>
References: <001a01c6a048$cb802420$15327e82@pyrimidine>
Message-ID: <1A5235F4-87E6-42D7-9796-7FEB8F7C04E5@ufl.edu>
Chris:
I just tried it out, and it looks like this solution works fine for
me. Thank you for the fix!
-Selvi
On Jul 5, 2006, at 11:36 AM, Chris Fields wrote:
> Okay, I managed to figure out what the problem was. I committed a
> fix in
> CVS for the initial bug (Selvi's missing hits). Still has one HSP
> per hit
> for now; I think it will take a bit more effort to get a BLAST-like
> multi
> HSP/hit up and running.
>
> Selvi, update from CVS to see if that works.
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Chris Fields
>> Sent: Friday, June 30, 2006 12:44 PM
>> To: Sendu Bala; Jason Stajich
>> Cc: bioperl-l at lists.open-bio.org list
>> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour
>>
>> I'll try looking at it this weekend. A suggested workaround is to
>> either try setting -A for no alignments or setting it to a high
>> number to retrieve all of them. It's pretty serious as the error
>> silently dumps those domains, so for those using automated annotation
>> pipelines would miss it unless they are also checking the raw output.
>>
>> You could design a SearchIO::hmmpfam parser then expand it to take in
>> hmmsearch output at a later point, or keep them separate. I like the
>> idea of having modules that are more specific about what they parse;
>> seems at some point you reach serious code bloat and maintenance
>> becomes an issue. Look at SearchIO::blast; it parses various text
>> BLAST output very well but with some serious obfuscation. Just don't
>> know how productive it would be to separate out the PSI-BLAST and
>> bl2seq stuff since they are pretty close to a standard BLAST
>> report... oh well.
>>
>> To Jason : good luck on your move. Drop us a line here to let us
>> know everything went well.
>>
>> Chris
>>
>> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote:
>>
>>> Chris Fields wrote:
>>>> It may have been just simpler to have it be one HSP (domain) per
>>>> Hit
>>>> (model) as that's how the reports are generated. My reasoning was
>>>> that
>>>> using the one domain per model made sense based on what you are
>>>> actually
>>>> trying to do, which is annotate the sequence based on the order the
>>>> domain appears. Most others may not view it that way, which is
>>>> fine.
>>>> One can always gather the relevant HSP's, convert to seqfeatures,
>>>> then
>>>> sort them if order is important, I suppose.
>>>>
>>>> I would say, if the overall consensus is to modify it to have
>>>> multiple
>>>> domain hits per model (similar to BLAST) then Sendu should go
>>>> ahead and
>>>> make those changes then announce it on the list so no one can gripe
>>>> about it later. My main concern was not changing things so
>>>> dramatically
>>>> that it'll break for someone
>>>
>>> Going on your earlier suggestion, I was thinking about making
>>> SearchIO::hmmpfam instead, which would get used if you set the
>>> format to
>>> 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I
>>> suppose I would make a SearchIO::hmmsearch as well, if necessary.
>>>
>>>
>>> [...]
>>>> that the reported bug about missing hits (Bug 2036) is fixed as
>>>> well.
>>>
>>> However, having never made a SearchIO plugin before, it will be some
>>> time before I get my head around it. I'll want to make one the
>>> current
>>> HOWTO:SearchIO way before I can think about doing it a better way
>>> (hashes) as well. So I can say I'll make a move on this at some
>>> point in
>>> the future, but if someone wants to fix Bug 2036 in the mean time,
>>> they
>>> are welcome to. Again as suggested, my priority is Bio::Map right
>>> now.
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
From cjfields at uiuc.edu Fri Jul 7 12:16:30 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 7 Jul 2006 11:16:30 -0500
Subject: [Bioperl-l] Bio::SeqFeatureI spliced_seq
Message-ID: <002a01c6a1e0$b4e2b360$15327e82@pyrimidine>
There is a reported bug (Bug 2039) which I found an easy fix for; the issue
is that spliced_seq, as currently implemented, has two optional arguments:
my ($self, $db, $nosort) = @_;
$db is-a Bio::DB::RandomAccessI; $nosort is a flag so that locations aren't
sorted before splicing, which is crux of the bug.
So, to set $nosort you must also set $db to either undef or a
Bio::DB::RandomAccessI (a point not made in the docs and not immediately
clear to the user). Would it make more sense to have something like this
(using $self->_rearrange to get the options)?
my $seq = $sf->spliced_seq(-nosort => 1);
my $seq = $sf->spliced_seq(-db => $db);
my $seq = $sf->spliced_seq(-nosort => 1
-db => $db);
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
From vebaev at gmail.com Sat Jul 8 16:59:40 2006
From: vebaev at gmail.com (Vesselin Baev)
Date: Sat, 08 Jul 2006 23:59:40 +0300
Subject: [Bioperl-l] BLAST running options
Message-ID: <44B01CBC.9070404@gmail.com>
Hi,
I'm parsing Blast results, but I need an Blast option to limit limit and
decrease the Blast number of results.
I'm blasting an oligo about 40nt and I need only results which are with
mismatches (not more than 10) or exactly matching but in the length as
the query - 40.
I do not want all the big amount of results that blast gave me about
shorter matching.
Do anyone knows what king of BLAST option to use?
Thanks
--
------------------------------------------------
University of Plovdiv
Faculty of Biology
Dept. Molecular Biology and Plant Physiology
Tzar Asen 24
Plovdiv 4000, BULGARIA
vebaev at gmail.com
tel.00359889034044
From cjfields at uiuc.edu Sat Jul 8 19:15:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 8 Jul 2006 18:15:29 -0500
Subject: [Bioperl-l] BLAST running options
In-Reply-To: <44B01CBC.9070404@gmail.com>
References: <44B01CBC.9070404@gmail.com>
Message-ID: <95D47990-9B63-444D-B386-04219D21DC39@uiuc.edu>
There were some posts about this a few months back.
http://bioperl.org/pipermail/bioperl-l/2006-April/021341.html
Essentially, most responders suggested not using BLAST, but I believe
there were a few who gave pointers.
Chris
On Jul 8, 2006, at 3:59 PM, Vesselin Baev wrote:
> Hi,
> I'm parsing Blast results, but I need an Blast option to limit
> limit and
> decrease the Blast number of results.
> I'm blasting an oligo about 40nt and I need only results which are
> with
> mismatches (not more than 10) or exactly matching but in the length as
> the query - 40.
> I do not want all the big amount of results that blast gave me about
> shorter matching.
>
> Do anyone knows what king of BLAST option to use?
> Thanks
>
> --
> ------------------------------------------------
>
> University of Plovdiv
> Faculty of Biology
> Dept. Molecular Biology and Plant Physiology
> Tzar Asen 24
> Plovdiv 4000, BULGARIA
> vebaev at gmail.com
> tel.00359889034044
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Mon Jul 10 17:09:12 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 10 Jul 2006 16:09:12 -0500
Subject: [Bioperl-l] How to use gi2taxonid
Message-ID: <000301c6a465$182025d0$15327e82@pyrimidine>
Hubert,
In case you didn't get this going, there may be another option now. I have
started work on a new set of modules called Bio::DB::EUtilities in
bioperl-live, intended as a back-end for NCBI database searches. It can be
used directly if needed though. You can use EPost/Elink to directly
retrieve the taxonIDs using the following script (pass a file containing the
protein/nucleotide primary ID on command line). The below retrieves
taxonid's using protein GI's:
use Bio::DB::EUtilities;
my @ids;
while (my $id = <>) {
chomp $id;
push @ids, $id;
}
my $epost = Bio::DB::EUtilities->new(
-eutil => 'epost',
-db => 'protein',
-id => \@ids,
);
$epost->get_response;
my $elink = Bio::DB::EUtilities->new(
-eutil => 'elink',
-cookie => $epost->next_cookie,
-db => 'taxonomy',
);
$elink->get_response;
my @tax_ids = $elink->get_db_ids;
Chris
> hi,
> I have downloaded the gi2taxonid file to get the taxonid for a GI
> number
> taken from a report as recommended here, but I don't know how to
> use the
> gi2taxonid file.
> Jason wrote in a previous post that you have to make a DB_File out of
> it, but I don't know how....and finally tie it to a hash....
> Can anybody give me a hint how to use it..... my final goal is to get
> the taxonomy.
>
> thanks
> Hubert
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
From hubert.prielinger at gmx.at Mon Jul 10 19:53:26 2006
From: hubert.prielinger at gmx.at (Hubert Prielinger)
Date: Mon, 10 Jul 2006 17:53:26 -0600
Subject: [Bioperl-l] How to use gi2taxonid
In-Reply-To: <000301c6a465$182025d0$15327e82@pyrimidine>
References: <000301c6a465$182025d0$15327e82@pyrimidine>
Message-ID: <44B2E876.2020200@gmx.at>
Hi Chris,
thanks for your response, actually I have done it with the EUtils,
because I have only accession ids and there is no possibility to retrieve
the taxonomy directly for an accession id. Because the xml files you
retrieve are very small, I first assign accession id to esearch, parse
the Uid from the xml file, assign Uid to esummary, parse tax id from xml
and finally, assign tax id to esummary again and retrieve taxonomy and
parse it..... I know a little bit intricatley, but it works fine.....thanks
regards
Hubert
Chris Fields wrote:
> Hubert,
>
> In case you didn't get this going, there may be another option now. I have
> started work on a new set of modules called Bio::DB::EUtilities in
> bioperl-live, intended as a back-end for NCBI database searches. It can be
> used directly if needed though. You can use EPost/Elink to directly
> retrieve the taxonIDs using the following script (pass a file containing the
> protein/nucleotide primary ID on command line). The below retrieves
> taxonid's using protein GI's:
>
>
> use Bio::DB::EUtilities;
> my @ids;
>
> while (my $id = <>) {
> chomp $id;
> push @ids, $id;
> }
>
> my $epost = Bio::DB::EUtilities->new(
> -eutil => 'epost',
> -db => 'protein',
> -id => \@ids,
> );
>
> $epost->get_response;
>
> my $elink = Bio::DB::EUtilities->new(
> -eutil => 'elink',
> -cookie => $epost->next_cookie,
> -db => 'taxonomy',
> );
>
> $elink->get_response;
>
> my @tax_ids = $elink->get_db_ids;
>
>
>
> Chris
>
>
>> hi,
>> I have downloaded the gi2taxonid file to get the taxonid for a GI
>> number
>> taken from a report as recommended here, but I don't know how to
>> use the
>> gi2taxonid file.
>> Jason wrote in a previous post that you have to make a DB_File out of
>> it, but I don't know how....and finally tie it to a hash....
>> Can anybody give me a hint how to use it..... my final goal is to get
>> the taxonomy.
>>
>> thanks
>> Hubert
>>
>
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
From MEC at stowers-institute.org Mon Jul 10 20:25:11 2006
From: MEC at stowers-institute.org (Cook, Malcolm)
Date: Mon, 10 Jul 2006 19:25:11 -0500
Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix?
Message-ID:
I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the
feature coordinates on - strand predictions.
In particular, start & end are deliberately reversed if the strand is
'-'.
I guess this was a holdover from Genscan.pm and wasn't really tested
!?!?!
Or, perhaps fgenesh v 2.4 which I am running has different output in
this respect compared to the version 2.0, against which this module was
written.
Or, perhaps my understanding is blotto (known to happen).
Does anyone know for sure?
If I comment out selected lines...
# if($predobj->strand() == 1) {
$predobj->start($start);
$predobj->end($end);
# } else {
# $predobj->end($start);
# $predobj->start($end);
# }
... then GFF produced by my naive fgenesh2gff script below is correct
(at least w.r.t. strand and coordinates - GFF compatibility purists
might wince).
Should I commit this change to head?
Malcolm Cook
Database Applications Manager, Bioinformatics
Stowers Institute for Medical Research
#!/usr/bin/env perl
# fgenesh2gff
# PURPOSE: parse fgenesh output into gff
# USAGE: fgenesh fish somefish.dna | fgenesh2gff >
somefish.dna.fgenesh.gff
use strict;
use warnings;
use Bio::Tools::Fgenesh;
use Bio::FeatureIO;
# Remaining options should name files to process, but if none, process
# standard input:
@ARGV = ('-') unless @ARGV;
my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV);
my $featureout = new Bio::Tools::GFF(
-gff_version => 2, #whatever ;)
);
my $IDNUM = 0;
while (my $gene = $fgenesh->next_prediction()) {
my $ID = "fgenesh" . ++ $IDNUM;
$gene->add_tag_value('ID', $ID);
$featureout->write_feature($gene);
foreach ($gene->exons()) {
$_->add_tag_value('Parent', $ID);
$_->seq_id($gene->seq_id);
$featureout->write_feature($_);
}
}
$fgenesh->close();
exit 0;
From chris at dwan.org Mon Jul 10 22:06:41 2006
From: chris at dwan.org (Christopher Dwan)
Date: Mon, 10 Jul 2006 22:06:41 -0400
Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix?
In-Reply-To:
References:
Message-ID:
I'm not surprised that there are parts that don't work right, I coped
genscan.pm and made the absolute minimal changes required to get what
I needed working. Haven't touched it since.
Please feel free to do what needs to be done, and sorry about the mess.
-Chris Dwan
On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote:
> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the
> feature coordinates on - strand predictions.
>
> In particular, start & end are deliberately reversed if the strand is
> '-'.
>
> I guess this was a holdover from Genscan.pm and wasn't really tested
> !?!?!
>
> Or, perhaps fgenesh v 2.4 which I am running has different output in
> this respect compared to the version 2.0, against which this module
> was
> written.
>
> Or, perhaps my understanding is blotto (known to happen).
>
> Does anyone know for sure?
>
> If I comment out selected lines...
>
> # if($predobj->strand() == 1) {
> $predobj->start($start);
> $predobj->end($end);
> # } else {
> # $predobj->end($start);
> # $predobj->start($end);
> # }
>
> ... then GFF produced by my naive fgenesh2gff script below is correct
> (at least w.r.t. strand and coordinates - GFF compatibility purists
> might wince).
>
> Should I commit this change to head?
>
>
> Malcolm Cook
> Database Applications Manager, Bioinformatics
> Stowers Institute for Medical Research
>
>
>
> #!/usr/bin/env perl
>
> # fgenesh2gff
> # PURPOSE: parse fgenesh output into gff
> # USAGE: fgenesh fish somefish.dna | fgenesh2gff >
> somefish.dna.fgenesh.gff
>
> use strict;
> use warnings;
> use Bio::Tools::Fgenesh;
> use Bio::FeatureIO;
>
> # Remaining options should name files to process, but if none, process
> # standard input:
> @ARGV = ('-') unless @ARGV;
> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV);
>
> my $featureout = new Bio::Tools::GFF(
> -gff_version => 2, #whatever ;)
> );
> my $IDNUM = 0;
> while (my $gene = $fgenesh->next_prediction()) {
> my $ID = "fgenesh" . ++ $IDNUM;
> $gene->add_tag_value('ID', $ID);
> $featureout->write_feature($gene);
> foreach ($gene->exons()) {
> $_->add_tag_value('Parent', $ID);
> $_->seq_id($gene->seq_id);
> $featureout->write_feature($_);
> }
> }
> $fgenesh->close();
>
> exit 0;
>
From rvosa at sfu.ca Tue Jul 11 04:58:46 2006
From: rvosa at sfu.ca (Rutger Vos)
Date: Tue, 11 Jul 2006 01:58:46 -0700
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
Message-ID: <44B36846.8070103@sfu.ca>
Dear all,
would it be possible to overload Bio::Root::RootI's 'throw' method to
accept an additional, optional (positional) argument to define the
exception class, e.g. using Exception::Class:
# ...somewhere ...
sub makefh {
my ( $self, $filename ) = @_;
open my $fh, '<' $filename or $self->throw("Can't open file: $!",
'Bio::Exceptions::FileIO'); # NOTE second argument
return $fh;
}
#.... somewhere else
my $fh;
eval {
$fh = $obj->makefh( 'data.txt');
}
if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
# something's wrong with the file?
}
--
++++++++++++++++++++++++++++++++++++++++++++++++++++
Rutger Vos, PhD. candidate
Department of Biological Sciences
Simon Fraser University
8888 University Drive
Burnaby, BC, V5A1S6
Phone: 604-291-5625
Fax: 604-291-3496
Personal site: http://www.sfu.ca/~rvosa
FAB* lab: http://www.sfu.ca/~fabstar
Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
++++++++++++++++++++++++++++++++++++++++++++++++++++
From khoiwal_tara at yahoo.co.in Tue Jul 11 08:19:17 2006
From: khoiwal_tara at yahoo.co.in (Khoiwal Tara)
Date: Tue, 11 Jul 2006 05:19:17 -0700 (PDT)
Subject: [Bioperl-l] Need help in needle parser
Message-ID: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com>
Hi,
I want to parse the output of needle.I tried but didn't able to get expected output.
My code is as follows:
#!/usr/local/bin/perl
use strict;
use warnings;
use Bio::AlignIO;
my $needleReport = $ARGV[0];
my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport);
while(my $align = $in->next_aln()){
print "Alignment Length:".$align->length()."\n";
print "Percentage Identity:".$align->percentage_identity()."\n";
print "Consensus string:".$align->consensus_string()."\n";
print "Number of sequences:".$align->no_sequence()."\n";
print "Number of residues:".$align->no_residues()."\n";
}
But it doesn't go inside the while loop.
Pls help me.
How to find the alignment position for the query sequence on the target sequence from the needle output?
Where can i find the good documentation on needle parser and its usage?
Good document on bioperl for beginners.
Regards,
Tara Khoiwal.
---------------------------------
Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better.
From cjfields at uiuc.edu Tue Jul 11 08:59:07 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 07:59:07 -0500
Subject: [Bioperl-l] Need help in needle parser
In-Reply-To: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com>
References: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com>
Message-ID: <250EEE60-48D0-4844-B0C0-13E17E60965C@uiuc.edu>
perldoc Bio::AlignIO
perldoc Bio::AlignIO::needle
http://www.bioperl.org/wiki/FAQ
http://www.bioperl.org/wiki/HOWTO:Beginners
http://www.bioperl.org/wiki/Bptutorial.pl
http://www.catb.org/~esr/faqs/smart-questions.html
Google is your friend!
If it isn't entering the while loop, there are two possibilities:
1) Something is wrong with the file
2) The parser isn't reading the file correctly
In order to know which, we will need to see the alignment itself.
Chris
On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote:
> Hi,
> I want to parse the output of needle.I tried but didn't able to
> get expected output.
>
> My code is as follows:
>
> #!/usr/local/bin/perl
>
> use strict;
> use warnings;
> use Bio::AlignIO;
> my $needleReport = $ARGV[0];
>
> my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport);
>
> while(my $align = $in->next_aln()){
> print "Alignment Length:".$align->length()."\n";
> print "Percentage Identity:".$align->percentage_identity()."\n";
> print "Consensus string:".$align->consensus_string()."\n";
> print "Number of sequences:".$align->no_sequence()."\n";
> print "Number of residues:".$align->no_residues()."\n";
> }
>
> But it doesn't go inside the while loop.
> Pls help me.
> How to find the alignment position for the query sequence on the
> target sequence from the needle output?
> Where can i find the good documentation on needle parser and its
> usage?
> Good document on bioperl for beginners.
>
> Regards,
> Tara Khoiwal.
>
>
> ---------------------------------
> Sneak preview the all-new Yahoo.com. It's not radically different.
> Just radically better.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Tue Jul 11 09:13:23 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 08:13:23 -0500
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <44B36846.8070103@sfu.ca>
References: <44B36846.8070103@sfu.ca>
Message-ID: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu>
I suppose you could; Bio::Root::Root does that using Error.pm (if it
is installed). It almost sounds like what Bio::Root::Root does is
what you want, but you want a little more information when exceptions
are thrown maybe?
from perldoc Bio::Root::Root:
...
# Alternatively, using the new typed exception syntax in
the throw() call:
$obj->throw( -class => 'Bio::Root::BadParameter',
-text => "Can not open file $file",
-value => $file);
...
Typed Exception Syntax
The typed exception syntax of throw() has the advantage of
plainly
indicating the nature of the trouble, since the name of the
class is
included in the title of the exception output.
To take advantage of this capability, you must specify
arguments as
named parameters in the throw() call. Here are the parameters:
-class
name of the class of the exception. This should be one
of the
classes defined in Bio::Root::Exception, or a custom
error of yours
that extends one of the exceptions defined in
Bio::Root::Exception.
-text
a sensible message for the exception
-value
the value causing the exception or $!, if appropriate.
Note that Bio::Root::Exception does not need to be imported
into your
module (or script) namespace in order to throw exceptions via
Bio::Root::Root::throw(), since Bio::Root::Root imports it.
Chris
On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
> Dear all,
>
> would it be possible to overload Bio::Root::RootI's 'throw' method to
> accept an additional, optional (positional) argument to define the
> exception class, e.g. using Exception::Class:
>
> # ...somewhere ...
>
> sub makefh {
> my ( $self, $filename ) = @_;
> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
> 'Bio::Exceptions::FileIO'); # NOTE second argument
> return $fh;
> }
>
> #.... somewhere else
> my $fh;
> eval {
> $fh = $obj->makefh( 'data.txt');
> }
> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
> # something's wrong with the file?
> }
>
> --
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> Rutger Vos, PhD. candidate
> Department of Biological Sciences
> Simon Fraser University
> 8888 University Drive
> Burnaby, BC, V5A1S6
> Phone: 604-291-5625
> Fax: 604-291-3496
> Personal site: http://www.sfu.ca/~rvosa
> FAB* lab: http://www.sfu.ca/~fabstar
> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Tue Jul 11 11:25:32 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 10:25:32 -0500
Subject: [Bioperl-l] Need help in needle parser
In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com>
Message-ID: <001601c6a4fe$3ff7ca10$15327e82@pyrimidine>
There are a few odd things about the data you sent; the FASTA files aren't
FASTA format (they are raw) and the needle output doesn't have sequence
names. You could try running these through needle with descriptors to see
if that helps, but.
it is very likely my option #2 (i.e. the parser doesn't recognize the
format). There is a thread on the mail list about this issue:
http://thread.gmane.org/gmane.comp.lang.perl.bio.general/8926/focus=8935
Basically, it looks like the needle output has changed dramatically in
EMBOSS v3. Jason's suggested options from the above thread (as well as
mine):
.
I think the "emboss" format changed in 3.0.0
solutions:
a) fix the AlignIO::emboss parser to handle both flavors (old and new)
b) have it output MSF format and use AlignIO::msf.
.
So, as a workaround, use MSF output.
I won't have time to look at this anytime soon as I'm busy at $job and
getting ready for a summer institute; I'll submit this as a bug to see if
someone else can tackle it before I get back in early August.
Chris
_____
From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in]
Sent: Tuesday, July 11, 2006 8:26 AM
To: Chris Fields
Subject: Re: [Bioperl-l] Need help in needle parser
I am sending my testing data to you. I have two fasta files
"GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as
follows:
$ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle
So the out put of the needle will get stored in outfile.needle. I am
attaching the output file also. Please check it and tell me if it has any
problem.
Is my output file is correct?
Thanks and Regards,
Tara.
Chris Fields wrote:
perldoc Bio::AlignIO
perldoc Bio::AlignIO::needle
http://www.bioperl.org/wiki/FAQ
http://www.bioperl.org/wiki/HOWTO:Beginners
http://www.bioperl.org/wiki/Bptutorial.pl
http://www.catb.org/~esr/faqs/smart-questions.html
Google is your friend!
If it isn't entering the while loop, there are two possibilities:
1) Something is wrong with the file
2) The parser isn't reading the file correctly
In order to know which, we will need to see the alignment itself.
Chris
On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote:
> Hi,
> I want to parse the output of needle.I tried but didn't able to
> get expected output.
>
> My code is as follows:
>
> #!/usr/local/bin/perl
>
> use strict;
> use warnings;
> use Bio::AlignIO;
> my $needleReport = $ARGV[0];
>
> my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport);
>
> while(my $align = $in->next_aln()){
> print "Alignment Length:".$align->length()."\n";
> print "Percentage Identity:".$align->percentage_identity()."\n";
> print "Consensus string:".$align->consensus_string()."\n";
> print "Number of sequences:".$align->no_sequence()."\n";
> print "Number of residues:".$align->no_residues()."\n";
> }
>
> But it doesn't go inside the while loop.
> Pls help me.
> How to find the alignment position for the query sequence on the
> target sequence from the needle output?
> Where can i find the good documentation on needle parser and its
> usage?
> Good document on bioperl for beginners.
>
> Regards,
> Tara Khoiwal.
>
>
> ---------------------------------
> Sneak preview the all-new Yahoo.com. It's not radically different.
> Just radically better.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
_____
Do you Yahoo!?
Next-gen email? Have it all with the all-new
Yahoo! Mail Beta.
From MEC at stowers-institute.org Tue Jul 11 11:56:40 2006
From: MEC at stowers-institute.org (Cook, Malcolm)
Date: Tue, 11 Jul 2006 10:56:40 -0500
Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix?
Message-ID:
Got it.
Commits made.
Thanks for the history lesson.
Cheers,
Malcolm Cook
>-----Original Message-----
>From: Christopher Dwan [mailto:chris at dwan.org]
>Sent: Monday, July 10, 2006 9:07 PM
>To: Cook, Malcolm
>Cc: bioperl-l
>Subject: Re: Bio::Tools::Fgenesh bug? and fix?
>
>
>I'm not surprised that there are parts that don't work right, I coped
>genscan.pm and made the absolute minimal changes required to get what
>I needed working. Haven't touched it since.
>
>Please feel free to do what needs to be done, and sorry about the mess.
>
>-Chris Dwan
>
>On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote:
>
>> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the
>> feature coordinates on - strand predictions.
>>
>> In particular, start & end are deliberately reversed if the strand is
>> '-'.
>>
>> I guess this was a holdover from Genscan.pm and wasn't really tested
>> !?!?!
>>
>> Or, perhaps fgenesh v 2.4 which I am running has different output in
>> this respect compared to the version 2.0, against which this module
>> was
>> written.
>>
>> Or, perhaps my understanding is blotto (known to happen).
>>
>> Does anyone know for sure?
>>
>> If I comment out selected lines...
>>
>> # if($predobj->strand() == 1) {
>> $predobj->start($start);
>> $predobj->end($end);
>> # } else {
>> # $predobj->end($start);
>> # $predobj->start($end);
>> # }
>>
>> ... then GFF produced by my naive fgenesh2gff script below is correct
>> (at least w.r.t. strand and coordinates - GFF compatibility purists
>> might wince).
>>
>> Should I commit this change to head?
>>
>>
>> Malcolm Cook
>> Database Applications Manager, Bioinformatics
>> Stowers Institute for Medical Research
>>
>>
>>
>> #!/usr/bin/env perl
>>
>> # fgenesh2gff
>> # PURPOSE: parse fgenesh output into gff
>> # USAGE: fgenesh fish somefish.dna | fgenesh2gff >
>> somefish.dna.fgenesh.gff
>>
>> use strict;
>> use warnings;
>> use Bio::Tools::Fgenesh;
>> use Bio::FeatureIO;
>>
>> # Remaining options should name files to process, but if
>none, process
>> # standard input:
>> @ARGV = ('-') unless @ARGV;
>> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV);
>>
>> my $featureout = new Bio::Tools::GFF(
>> -gff_version => 2, #whatever ;)
>> );
>> my $IDNUM = 0;
>> while (my $gene = $fgenesh->next_prediction()) {
>> my $ID = "fgenesh" . ++ $IDNUM;
>> $gene->add_tag_value('ID', $ID);
>> $featureout->write_feature($gene);
>> foreach ($gene->exons()) {
>> $_->add_tag_value('Parent', $ID);
>> $_->seq_id($gene->seq_id);
>> $featureout->write_feature($_);
>> }
>> }
>> $fgenesh->close();
>>
>> exit 0;
>>
>
>
From cjfields at uiuc.edu Tue Jul 11 12:04:49 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 11:04:49 -0500
Subject: [Bioperl-l] Need help in needle parser
In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com>
Message-ID: <000101c6a503$bd982eb0$15327e82@pyrimidine>
Okay, I take that back. Bio::AlignIO::emboss does parse EMBOSS v3 needle
output! The fact that it doesn't parse your alignment is b/c there are no
sequence descriptors in the file for the sequences (your FASTA files didn't
have them either). Modifying the file to contain descriptions for both the
alignment and the 'Aligned_sequences:' section gets your test alignment to
work. I consider this a feature and not a bug; how would others be able to
distinguish between numerous sequences in an alignment w/o identifiers of
some sort? It shouldn't just toss this out without a warning however; I'll
try to add a little exception handling.
BTW, one line is incorrect in your script; it should be
print "Number of sequences:".$align->no_sequences()."\n";
you have
print "Number of sequences:".$align->no_sequence()."\n";
Chris
_____
From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in]
Sent: Tuesday, July 11, 2006 8:26 AM
To: Chris Fields
Subject: Re: [Bioperl-l] Need help in needle parser
I am sending my testing data to you. I have two fasta files
"GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as
follows:
$ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle
So the out put of the needle will get stored in outfile.needle. I am
attaching the output file also. Please check it and tell me if it has any
problem.
Is my output file is correct?
Thanks and Regards,
Tara.
Chris Fields wrote:
perldoc Bio::AlignIO
perldoc Bio::AlignIO::needle
http://www.bioperl.org/wiki/FAQ
http://www.bioperl.org/wiki/HOWTO:Beginners
http://www.bioperl.org/wiki/Bptutorial.pl
http://www.catb.org/~esr/faqs/smart-questions.html
Google is your friend!
If it isn't entering the while loop, there are two possibilities:
1) Something is wrong with the file
2) The parser isn't reading the file correctly
In order to know which, we will need to see the alignment itself.
Chris
On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote:
> Hi,
> I want to parse the output of needle.I tried but didn't able to
> get expected output.
>
> My code is as follows:
>
> #!/usr/local/bin/perl
>
> use strict;
> use warnings;
> use Bio::AlignIO;
> my $needleReport = $ARGV[0];
>
> my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport);
>
> while(my $align = $in->next_aln()){
> print "Alignment Length:".$align->length()."\n";
> print "Percentage Identity:".$align->percentage_identity()."\n";
> print "Consensus string:".$align->consensus_string()."\n";
> print "Number of sequences:".$align->no_sequence()."\n";
> print "Number of residues:".$align->no_residues()."\n";
> }
>
> But it doesn't go inside the while loop.
> Pls help me.
> How to find the alignment position for the query sequence on the
> target sequence from the needle output?
> Where can i find the good documentation on needle parser and its
> usage?
> Good document on bioperl for beginners.
>
> Regards,
> Tara Khoiwal.
>
>
> ---------------------------------
> Sneak preview the all-new Yahoo.com. It's not radically different.
> Just radically better.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
_____
Do you Yahoo!?
Next-gen email? Have it all with the all-new
Yahoo! Mail Beta.
From wrp at virginia.edu Tue Jul 11 14:05:29 2006
From: wrp at virginia.edu (William R. Pearson)
Date: Tue, 11 Jul 2006 14:05:29 -0400
Subject: [Bioperl-l] Course announcement: CSHL Computational Genomics Course
In-Reply-To:
References:
Message-ID: <45D80228-35DE-44B0-9E11-48EC76CE0DE7@virginia.edu>
Course announcement - Application deadline, July 15, 2006
================================================================
Cold Spring Harbor
COMPUTATIONAL & COMPARATIVE GENOMICS
November 8 - 14, 2006
Application Deadline: July 15, 2006
INSTRUCTORS:
Pearson, William, Ph.D., University of Virginia, Charlottesville, VA
Smith, Randall, Ph.D., SmithKline Beecham Pharmaceuticals, King of
Prussia, PA
Beyond BLAST and FASTA - Alignment: from proteins to genomes - This
course presents a comprehensive overview of the theory and practice of
computational methods for extracting the maximum amount of information
from protein and DNA sequence similarity through sequence database
searches, statistical analysis, and multiple sequence alignment, and
genome scale alignment. Additional topics include gene finding,
dentifying signals in unaligned sequences, integration of genetic and
sequence information in biological databases.
The course combines lectures with hands-on exercises; students are
encouraged to pose challenging sequence analysis problems using their
own data. The course makes extensive use of local WWW pages to present
problem sets and the computing tools to solve them. Students use
Windows and Mac workstations attached to a UNIX server; participants
should be comfortable using the Unix operating system and a Unix text
editor.
The course is designed for biologists seeking advanced training in
biological sequence analysis, computational biology core resource
directors and staff, and for scientists in other disciplines, such as
computer science, who wish to survey current research problems in
biological sequence analysis and comparative genomics.
The primary focus of the Computational and Comparative Genomics Course
is the theory and practice of algorithms used in computational
biology, with the goal of using current methods more effectively and
developing new algorithms. Cold Spring Harbor also offers a
"Programming for Biology" course, which focuses more on software
development.
Over the past few years, the course has been expanded to cover more
algorithms and exercises on comparative genomics and genome databases.
For additional information and the lecture schedule and problem sets
for the 2005 course, see:
http://fasta.bioch.virginia.edu/cshl05
================================================================
To apply to the course, fill out the form at:
http://meetings.cshl.edu/courses/courseapplication.asp
================================================================
From rvosa at sfu.ca Tue Jul 11 14:58:25 2006
From: rvosa at sfu.ca (Rutger Vos)
Date: Tue, 11 Jul 2006 11:58:25 -0700
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu>
References: <44B36846.8070103@sfu.ca>
<954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu>
Message-ID: <44B3F4D1.7090804@sfu.ca>
I must have overlooked this. I think it does what I want. So could I do
something like:
$obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' );
...in interfaces?
Chris Fields wrote:
> I suppose you could; Bio::Root::Root does that using Error.pm (if it
> is installed). It almost sounds like what Bio::Root::Root does is
> what you want, but you want a little more information when exceptions
> are thrown maybe?
>
> from perldoc Bio::Root::Root:
>
> ...
> # Alternatively, using the new typed exception syntax in
> the throw() call:
>
> $obj->throw( -class => 'Bio::Root::BadParameter',
> -text => "Can not open file $file",
> -value => $file);
> ...
>
> Typed Exception Syntax
>
> The typed exception syntax of throw() has the advantage of
> plainly
> indicating the nature of the trouble, since the name of the
> class is
> included in the title of the exception output.
>
> To take advantage of this capability, you must specify
> arguments as
> named parameters in the throw() call. Here are the parameters:
>
> -class
> name of the class of the exception. This should be one
> of the
> classes defined in Bio::Root::Exception, or a custom
> error of yours
> that extends one of the exceptions defined in
> Bio::Root::Exception.
>
> -text
> a sensible message for the exception
>
> -value
> the value causing the exception or $!, if appropriate.
>
> Note that Bio::Root::Exception does not need to be imported
> into your
> module (or script) namespace in order to throw exceptions via
> Bio::Root::Root::throw(), since Bio::Root::Root imports it.
>
>
> Chris
>
> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
>
>
>> Dear all,
>>
>> would it be possible to overload Bio::Root::RootI's 'throw' method to
>> accept an additional, optional (positional) argument to define the
>> exception class, e.g. using Exception::Class:
>>
>> # ...somewhere ...
>>
>> sub makefh {
>> my ( $self, $filename ) = @_;
>> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>> return $fh;
>> }
>>
>> #.... somewhere else
>> my $fh;
>> eval {
>> $fh = $obj->makefh( 'data.txt');
>> }
>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>> # something's wrong with the file?
>> }
>>
>> --
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Rutger Vos, PhD. candidate
>> Department of Biological Sciences
>> Simon Fraser University
>> 8888 University Drive
>> Burnaby, BC, V5A1S6
>> Phone: 604-291-5625
>> Fax: 604-291-3496
>> Personal site: http://www.sfu.ca/~rvosa
>> FAB* lab: http://www.sfu.ca/~fabstar
>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
--
++++++++++++++++++++++++++++++++++++++++++++++++++++
Rutger Vos, PhD. candidate
Department of Biological Sciences
Simon Fraser University
8888 University Drive
Burnaby, BC, V5A1S6
Phone: 604-291-5625
Fax: 604-291-3496
Personal site: http://www.sfu.ca/~rvosa
FAB* lab: http://www.sfu.ca/~fabstar
Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
++++++++++++++++++++++++++++++++++++++++++++++++++++
From hlapp at gmx.net Tue Jul 11 15:05:03 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 11 Jul 2006 15:05:03 -0400
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <44B36846.8070103@sfu.ca>
References: <44B36846.8070103@sfu.ca>
Message-ID: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net>
I think it does this already, except that I believe you need to
create the exception object and initialize with the message upfront.
Steve, can you comment? Is this at least somewhat right?
-hilmar
On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote:
> Dear all,
>
> would it be possible to overload Bio::Root::RootI's 'throw' method to
> accept an additional, optional (positional) argument to define the
> exception class, e.g. using Exception::Class:
>
> # ...somewhere ...
>
> sub makefh {
> my ( $self, $filename ) = @_;
> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
> 'Bio::Exceptions::FileIO'); # NOTE second argument
> return $fh;
> }
>
> #.... somewhere else
> my $fh;
> eval {
> $fh = $obj->makefh( 'data.txt');
> }
> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
> # something's wrong with the file?
> }
>
> --
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> Rutger Vos, PhD. candidate
> Department of Biological Sciences
> Simon Fraser University
> 8888 University Drive
> Burnaby, BC, V5A1S6
> Phone: 604-291-5625
> Fax: 604-291-3496
> Personal site: http://www.sfu.ca/~rvosa
> FAB* lab: http://www.sfu.ca/~fabstar
> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Tue Jul 11 15:05:54 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 11 Jul 2006 15:05:54 -0400
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu>
References: <44B36846.8070103@sfu.ca>
<954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu>
Message-ID: <297D4770-A963-4039-8D90-987CC570BA94@gmx.net>
Alright - well spotted Chris. This is what I was looking for.
On Jul 11, 2006, at 9:13 AM, Chris Fields wrote:
> I suppose you could; Bio::Root::Root does that using Error.pm (if it
> is installed). It almost sounds like what Bio::Root::Root does is
> what you want, but you want a little more information when exceptions
> are thrown maybe?
>
> from perldoc Bio::Root::Root:
>
> ...
> # Alternatively, using the new typed exception syntax in
> the throw() call:
>
> $obj->throw( -class => 'Bio::Root::BadParameter',
> -text => "Can not open file $file",
> -value => $file);
> ...
>
> Typed Exception Syntax
>
> The typed exception syntax of throw() has the advantage of
> plainly
> indicating the nature of the trouble, since the name of the
> class is
> included in the title of the exception output.
>
> To take advantage of this capability, you must specify
> arguments as
> named parameters in the throw() call. Here are the parameters:
>
> -class
> name of the class of the exception. This should be one
> of the
> classes defined in Bio::Root::Exception, or a custom
> error of yours
> that extends one of the exceptions defined in
> Bio::Root::Exception.
>
> -text
> a sensible message for the exception
>
> -value
> the value causing the exception or $!, if appropriate.
>
> Note that Bio::Root::Exception does not need to be imported
> into your
> module (or script) namespace in order to throw exceptions via
> Bio::Root::Root::throw(), since Bio::Root::Root imports it.
>
>
> Chris
>
> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
>
>> Dear all,
>>
>> would it be possible to overload Bio::Root::RootI's 'throw' method to
>> accept an additional, optional (positional) argument to define the
>> exception class, e.g. using Exception::Class:
>>
>> # ...somewhere ...
>>
>> sub makefh {
>> my ( $self, $filename ) = @_;
>> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>> return $fh;
>> }
>>
>> #.... somewhere else
>> my $fh;
>> eval {
>> $fh = $obj->makefh( 'data.txt');
>> }
>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>> # something's wrong with the file?
>> }
>>
>> --
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Rutger Vos, PhD. candidate
>> Department of Biological Sciences
>> Simon Fraser University
>> 8888 University Drive
>> Burnaby, BC, V5A1S6
>> Phone: 604-291-5625
>> Fax: 604-291-3496
>> Personal site: http://www.sfu.ca/~rvosa
>> FAB* lab: http://www.sfu.ca/~fabstar
>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Tue Jul 11 16:42:35 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 15:42:35 -0500
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <44B3F4D1.7090804@sfu.ca>
Message-ID: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine>
Bio::Root::Root doesn't overload throw_not_implemented from
Bio::Root::RootI; from the comments looks like Steve C and Ewan B couldn't
work out some of the Error.pm issues.
Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't
accept arguments; it throws a Bio::Root::NotImplemented exception
automatically.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Rutger Vos
> Sent: Tuesday, July 11, 2006 1:58 PM
> To: Chris Fields
> Cc: 'Bioperl List'
> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
>
> I must have overlooked this. I think it does what I want. So could I do
> something like:
>
> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' );
>
> ...in interfaces?
>
> Chris Fields wrote:
> > I suppose you could; Bio::Root::Root does that using Error.pm (if it
> > is installed). It almost sounds like what Bio::Root::Root does is
> > what you want, but you want a little more information when exceptions
> > are thrown maybe?
> >
> > from perldoc Bio::Root::Root:
> >
> > ...
> > # Alternatively, using the new typed exception syntax in
> > the throw() call:
> >
> > $obj->throw( -class => 'Bio::Root::BadParameter',
> > -text => "Can not open file $file",
> > -value => $file);
> > ...
> >
> > Typed Exception Syntax
> >
> > The typed exception syntax of throw() has the advantage of
> > plainly
> > indicating the nature of the trouble, since the name of the
> > class is
> > included in the title of the exception output.
> >
> > To take advantage of this capability, you must specify
> > arguments as
> > named parameters in the throw() call. Here are the parameters:
> >
> > -class
> > name of the class of the exception. This should be one
> > of the
> > classes defined in Bio::Root::Exception, or a custom
> > error of yours
> > that extends one of the exceptions defined in
> > Bio::Root::Exception.
> >
> > -text
> > a sensible message for the exception
> >
> > -value
> > the value causing the exception or $!, if appropriate.
> >
> > Note that Bio::Root::Exception does not need to be imported
> > into your
> > module (or script) namespace in order to throw exceptions via
> > Bio::Root::Root::throw(), since Bio::Root::Root imports it.
> >
> >
> > Chris
> >
> > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
> >
> >
> >> Dear all,
> >>
> >> would it be possible to overload Bio::Root::RootI's 'throw' method to
> >> accept an additional, optional (positional) argument to define the
> >> exception class, e.g. using Exception::Class:
> >>
> >> # ...somewhere ...
> >>
> >> sub makefh {
> >> my ( $self, $filename ) = @_;
> >> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
> >> 'Bio::Exceptions::FileIO'); # NOTE second argument
> >> return $fh;
> >> }
> >>
> >> #.... somewhere else
> >> my $fh;
> >> eval {
> >> $fh = $obj->makefh( 'data.txt');
> >> }
> >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
> >> # something's wrong with the file?
> >> }
> >>
> >> --
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Rutger Vos, PhD. candidate
> >> Department of Biological Sciences
> >> Simon Fraser University
> >> 8888 University Drive
> >> Burnaby, BC, V5A1S6
> >> Phone: 604-291-5625
> >> Fax: 604-291-3496
> >> Personal site: http://www.sfu.ca/~rvosa
> >> FAB* lab: http://www.sfu.ca/~fabstar
> >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> >
> >
>
> --
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> Rutger Vos, PhD. candidate
> Department of Biological Sciences
> Simon Fraser University
> 8888 University Drive
> Burnaby, BC, V5A1S6
> Phone: 604-291-5625
> Fax: 604-291-3496
> Personal site: http://www.sfu.ca/~rvosa
> FAB* lab: http://www.sfu.ca/~fabstar
> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From frederick.partridge at st-johns.oxford.ac.uk Tue Jul 11 17:23:28 2006
From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge)
Date: Tue, 11 Jul 2006 22:23:28 +0100 (BST)
Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from
genpept
Message-ID:
I am trying to retrieve various protein sequences from genpept using
get_Seq_by_acc. All of them work ok, except one T16005:
If I try and retrieve it with a reduced program:
#!usr/bin/perl -w
use strict;
use Bio::Perl;
use Bio::SeqIO;
my $genpept = new Bio::DB::GenPept;
my $seq = $genpept->get_Seq_by_acc('T16005');
print ($seq->seq(),'\n');
I get back a nucleotide sequence, which is another sequence at NCBI with
the same accession number. (I thought these were meant to be unique? but
evidently not.)
I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3
Could anyone help me to get this protein sequence with my program?
Many thanks,
Freddie Partridge
University of Oxford
From qfdong at iastate.edu Tue Jul 11 17:32:56 2006
From: qfdong at iastate.edu (Qunfeng)
Date: Tue, 11 Jul 2006 16:32:56 -0500
Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from
genpept
In-Reply-To:
References:
Message-ID: <6.1.2.0.2.20060711163128.08086570@qfdong.mail.iastate.edu>
This particular protein record (acc#T16005) was imported from PIR. In other
words, this is not an original GenBank protein record. When GenBank imports
protein records from other DB, it keeps their original acc#.
However, gi# should be unique.
Q
At 04:23 PM 7/11/2006, Frederick Partridge wrote:
>I am trying to retrieve various protein sequences from genpept using
>get_Seq_by_acc. All of them work ok, except one T16005:
>
>
>If I try and retrieve it with a reduced program:
>
>
>#!usr/bin/perl -w
>
>use strict;
>
>use Bio::Perl;
>use Bio::SeqIO;
>
>my $genpept = new Bio::DB::GenPept;
>
>my $seq = $genpept->get_Seq_by_acc('T16005');
>
>print ($seq->seq(),'\n');
>
>
>
>I get back a nucleotide sequence, which is another sequence at NCBI with
>the same accession number. (I thought these were meant to be unique? but
>evidently not.)
>
>
>I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3
>
>
>Could anyone help me to get this protein sequence with my program?
>
>
>Many thanks,
>
>
>
>Freddie Partridge
>
>University of Oxford
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Tue Jul 11 18:05:09 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 17:05:09 -0500
Subject: [Bioperl-l] Get nucleotide sequence when expecting protein
fromgenpept
In-Reply-To:
Message-ID: <000001c6a536$141befb0$15327e82@pyrimidine>
It's an imprted PIR record, so there probably is no accession recorded in
the database. I think NCBI uses a fallback to nucleotide if it can't find a
particular accession via protein. Using the primary ID (the GI#, 7498730)
works.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Frederick Partridge
> Sent: Tuesday, July 11, 2006 4:23 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Get nucleotide sequence when expecting protein
> fromgenpept
>
>
>
> I am trying to retrieve various protein sequences from genpept using
> get_Seq_by_acc. All of them work ok, except one T16005:
>
>
> If I try and retrieve it with a reduced program:
>
>
> #!usr/bin/perl -w
>
> use strict;
>
> use Bio::Perl;
> use Bio::SeqIO;
>
> my $genpept = new Bio::DB::GenPept;
>
> my $seq = $genpept->get_Seq_by_acc('T16005');
>
> print ($seq->seq(),'\n');
>
>
>
> I get back a nucleotide sequence, which is another sequence at NCBI with
> the same accession number. (I thought these were meant to be unique? but
> evidently not.)
>
>
> I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3
>
>
> Could anyone help me to get this protein sequence with my program?
>
>
> Many thanks,
>
>
>
> Freddie Partridge
>
> University of Oxford
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Tue Jul 11 18:47:38 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 17:47:38 -0500
Subject: [Bioperl-l] Get nucleotide sequence when expecting
proteinfromgenpept
In-Reply-To: <000001c6a536$141befb0$15327e82@pyrimidine>
Message-ID: <000201c6a53c$03970ed0$15327e82@pyrimidine>
Okay, now try this:
use Bio::DB::GenPept;
use Bio::SeqIO;
my $factory = Bio::DB::GenPept->new(-format => 'fasta');
my $seqin = $factory->get_Stream_by_acc('T16005');
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
-format => 'fasta');
while (my $seq = $seqin->next_seq) {
$seqout->write_seq($seq);
}
This returns both the nucleotide sequence and the correct protein sequence;
the protein was returned second for some reason, so get_Seq_by_acc misses it
while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but
they will likely just tell me to use the GI number for searches as they are
unique. Probably a good warning for anyone using accessions for all their
work (I use the GI myself).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Tuesday, July 11, 2006 5:05 PM
> To: 'Frederick Partridge'; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting
> proteinfromgenpept
>
> It's an imprted PIR record, so there probably is no accession recorded in
> the database. I think NCBI uses a fallback to nucleotide if it can't find
> a
> particular accession via protein. Using the primary ID (the GI#, 7498730)
> works.
>
> Chris
>
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge
> > Sent: Tuesday, July 11, 2006 4:23 PM
> > To: bioperl-l at lists.open-bio.org
> > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein
> > fromgenpept
> >
> >
> >
> > I am trying to retrieve various protein sequences from genpept using
> > get_Seq_by_acc. All of them work ok, except one T16005:
> >
> >
> > If I try and retrieve it with a reduced program:
> >
> >
> > #!usr/bin/perl -w
> >
> > use strict;
> >
> > use Bio::Perl;
> > use Bio::SeqIO;
> >
> > my $genpept = new Bio::DB::GenPept;
> >
> > my $seq = $genpept->get_Seq_by_acc('T16005');
> >
> > print ($seq->seq(),'\n');
> >
> >
> >
> > I get back a nucleotide sequence, which is another sequence at NCBI with
> > the same accession number. (I thought these were meant to be unique? but
> > evidently not.)
> >
> >
> > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3
> >
> >
> > Could anyone help me to get this protein sequence with my program?
> >
> >
> > Many thanks,
> >
> >
> >
> > Freddie Partridge
> >
> > University of Oxford
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From Steve_Chervitz at affymetrix.com Tue Jul 11 20:21:16 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Tue, 11 Jul 2006 17:21:16 -0700
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net>
Message-ID:
The Bio::Root::Root object is rigged to use the Error.pm module if
available, so you can throw and catch of exception objects derived from
Error. The motivation here was to provide a recommended path for folks that
want to use more structured exception handling logic in their bioperl code.
There are a number of pre-defined subclasses of exceptions that cover common
problems (such as FileOpenException), but you can also define your own. See
a list of the predfined exceptions as well as some how to docs in the POD
for Bio::Root::Exception:
http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/Exception.pm
There's a bunch more info about Bioperl exception fun available from the
bioperl distribution under the examples/root directory. See the README in
that directory to get oriented. There are a number of demo scripts there,
too.
Bio::Root::Root doesn't know anything about Exception::Class, but I see you
can use it with Error.pm as described here:
http://search.cpan.org/~drolsky/Exception-Class-1.23/lib/Exception/Class.pm#
OTHER_EXCEPTION_MODULES_(try%2Fcatch_syntax)
Cheers,
Steve
> From: Hilmar Lapp
> Date: Tue, 11 Jul 2006 15:05:03 -0400
> To: Rutger Vos
> Cc: Bioperl , Steve Chervitz
>
> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
>
> I think it does this already, except that I believe you need to
> create the exception object and initialize with the message upfront.
>
> Steve, can you comment? Is this at least somewhat right?
>
> -hilmar
>
> On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote:
>
>> Dear all,
>>
>> would it be possible to overload Bio::Root::RootI's 'throw' method to
>> accept an additional, optional (positional) argument to define the
>> exception class, e.g. using Exception::Class:
>>
>> # ...somewhere ...
>>
>> sub makefh {
>> my ( $self, $filename ) = @_;
>> open my $fh, '<' $filename or $self->throw("Can't open file: $!",
>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>> return $fh;
>> }
>>
>> #.... somewhere else
>> my $fh;
>> eval {
>> $fh = $obj->makefh( 'data.txt');
>> }
>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>> # something's wrong with the file?
>> }
>>
>> --
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Rutger Vos, PhD. candidate
>> Department of Biological Sciences
>> Simon Fraser University
>> 8888 University Drive
>> Burnaby, BC, V5A1S6
>> Phone: 604-291-5625
>> Fax: 604-291-3496
>> Personal site: http://www.sfu.ca/~rvosa
>> FAB* lab: http://www.sfu.ca/~fabstar
>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
From Steve_Chervitz at affymetrix.com Tue Jul 11 21:07:06 2006
From: Steve_Chervitz at affymetrix.com (Steve_Chervitz)
Date: Tue, 11 Jul 2006 18:07:06 -0700
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine>
References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine>
Message-ID: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com>
On Jul 11, 2006, at 1:42 PM, Chris Fields wrote:
> Bio::Root::Root doesn't overload throw_not_implemented from
> Bio::Root::RootI; from the comments looks like Steve C and Ewan B
> couldn't
> work out some of the Error.pm issues.
The issue (I believe) was that
Bio::Root::RootI::throw_not_implemented was doing some checking for
the presence of Error.pm and calling Error::throw. I changed it so
that this fanciness only happens in Root.pm.
> Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't
> accept arguments; it throws a Bio::Root::NotImplemented exception
> automatically.
Looking at the code now, throw_not_implemented() does not throw a
Bio::Root::NotImplemented exception. It just throws a simple,
unclassed message. We could allow it to throw an exception of class
Bio::Root:NotImplemented by changing this code:
if( $self->can('throw') ) {
$self->throw($message);
}...
to this
if( $self->can('throw') ) {
$self->throw(-text=>$message, -class=>'Bio::Root::NotImplemented');
}...
This does not create any dependency on Error.pm, but permits it to be
used if available. If Error.pm is not loaded, the only change is that
the class string is included in the error message, which is kind of
handy.
Trouble would occur if the implementing class:
* does not derive from Bio::Root::Root,
* does not import Bio::Root::Exception,
* fails to implement a method which gets called, and
* Error.pm is available.
I don't know if such implementations exist in bioperl now, but I
suspect they would be rare (and discouraged).
Steve
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos
>> Sent: Tuesday, July 11, 2006 1:58 PM
>> To: Chris Fields
>> Cc: 'Bioperl List'
>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class)
>> overloading?
>>
>> I must have overlooked this. I think it does what I want. So could
>> I do
>> something like:
>>
>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' );
>>
>> ...in interfaces?
>>
>> Chris Fields wrote:
>>> I suppose you could; Bio::Root::Root does that using Error.pm (if it
>>> is installed). It almost sounds like what Bio::Root::Root does is
>>> what you want, but you want a little more information when
>>> exceptions
>>> are thrown maybe?
>>>
>>> from perldoc Bio::Root::Root:
>>>
>>> ...
>>> # Alternatively, using the new typed exception syntax in
>>> the throw() call:
>>>
>>> $obj->throw( -class => 'Bio::Root::BadParameter',
>>> -text => "Can not open file $file",
>>> -value => $file);
>>> ...
>>>
>>> Typed Exception Syntax
>>>
>>> The typed exception syntax of throw() has the advantage of
>>> plainly
>>> indicating the nature of the trouble, since the name of the
>>> class is
>>> included in the title of the exception output.
>>>
>>> To take advantage of this capability, you must specify
>>> arguments as
>>> named parameters in the throw() call. Here are the
>>> parameters:
>>>
>>> -class
>>> name of the class of the exception. This should be one
>>> of the
>>> classes defined in Bio::Root::Exception, or a custom
>>> error of yours
>>> that extends one of the exceptions defined in
>>> Bio::Root::Exception.
>>>
>>> -text
>>> a sensible message for the exception
>>>
>>> -value
>>> the value causing the exception or $!, if appropriate.
>>>
>>> Note that Bio::Root::Exception does not need to be imported
>>> into your
>>> module (or script) namespace in order to throw exceptions
>>> via
>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it.
>>>
>>>
>>> Chris
>>>
>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
>>>
>>>
>>>> Dear all,
>>>>
>>>> would it be possible to overload Bio::Root::RootI's 'throw'
>>>> method to
>>>> accept an additional, optional (positional) argument to define the
>>>> exception class, e.g. using Exception::Class:
>>>>
>>>> # ...somewhere ...
>>>>
>>>> sub makefh {
>>>> my ( $self, $filename ) = @_;
>>>> open my $fh, '<' $filename or $self->throw("Can't open file:
>>>> $!",
>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>>>> return $fh;
>>>> }
>>>>
>>>> #.... somewhere else
>>>> my $fh;
>>>> eval {
>>>> $fh = $obj->makefh( 'data.txt');
>>>> }
>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>>>> # something's wrong with the file?
>>>> }
>>>>
>>>> --
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Rutger Vos, PhD. candidate
>>>> Department of Biological Sciences
>>>> Simon Fraser University
>>>> 8888 University Drive
>>>> Burnaby, BC, V5A1S6
>>>> Phone: 604-291-5625
>>>> Fax: 604-291-3496
>>>> Personal site: http://www.sfu.ca/~rvosa
>>>> FAB* lab: http://www.sfu.ca/~fabstar
>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>
>> --
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Rutger Vos, PhD. candidate
>> Department of Biological Sciences
>> Simon Fraser University
>> 8888 University Drive
>> Burnaby, BC, V5A1S6
>> Phone: 604-291-5625
>> Fax: 604-291-3496
>> Personal site: http://www.sfu.ca/~rvosa
>> FAB* lab: http://www.sfu.ca/~fabstar
>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Tue Jul 11 23:27:37 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 11 Jul 2006 22:27:37 -0500
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com>
References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine>
<337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com>
Message-ID: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu>
Makes sense to keep most of the magic in Root instead of RootI.pm.
The POD for RootI does state that the class exception thrown is
Bio::Root::NotImplemented, so we should probably either change the
POD to reflect what really happens or change throw_not_implemented
like you suggest (my vote is the latter). I don't think many (if
any) implementing classes fall into your 'trouble' category, though I
can't be sure how many actually import Bio::Root::Exception.
Chris
On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote:
> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote:
>
>> Bio::Root::Root doesn't overload throw_not_implemented from
>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B
>> couldn't
>> work out some of the Error.pm issues.
>
> The issue (I believe) was that
> Bio::Root::RootI::throw_not_implemented was doing some checking for
> the presence of Error.pm and calling Error::throw. I changed it so
> that this fanciness only happens in Root.pm.
>
>> Judging by the POD for Bio::Root::RootI, throw_not_implemented
>> doesn't
>> accept arguments; it throws a Bio::Root::NotImplemented exception
>> automatically.
>
> Looking at the code now, throw_not_implemented() does not throw a
> Bio::Root::NotImplemented exception. It just throws a simple,
> unclassed message. We could allow it to throw an exception of class
> Bio::Root:NotImplemented by changing this code:
>
> if( $self->can('throw') ) {
> $self->throw($message);
> }...
>
> to this
>
> if( $self->can('throw') ) {
> $self->throw(-text=>$message, -
> class=>'Bio::Root::NotImplemented');
> }...
>
> This does not create any dependency on Error.pm, but permits it to
> be used if available. If Error.pm is not loaded, the only change is
> that the class string is included in the error message, which is
> kind of handy.
>
> Trouble would occur if the implementing class:
>
> * does not derive from Bio::Root::Root,
> * does not import Bio::Root::Exception,
> * fails to implement a method which gets called, and
> * Error.pm is available.
>
> I don't know if such implementations exist in bioperl now, but I
> suspect they would be rare (and discouraged).
>
> Steve
>
>
>> Chris
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos
>>> Sent: Tuesday, July 11, 2006 1:58 PM
>>> To: Chris Fields
>>> Cc: 'Bioperl List'
>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class)
>>> overloading?
>>>
>>> I must have overlooked this. I think it does what I want. So
>>> could I do
>>> something like:
>>>
>>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' );
>>>
>>> ...in interfaces?
>>>
>>> Chris Fields wrote:
>>>> I suppose you could; Bio::Root::Root does that using Error.pm
>>>> (if it
>>>> is installed). It almost sounds like what Bio::Root::Root does is
>>>> what you want, but you want a little more information when
>>>> exceptions
>>>> are thrown maybe?
>>>>
>>>> from perldoc Bio::Root::Root:
>>>>
>>>> ...
>>>> # Alternatively, using the new typed exception syntax in
>>>> the throw() call:
>>>>
>>>> $obj->throw( -class => 'Bio::Root::BadParameter',
>>>> -text => "Can not open file $file",
>>>> -value => $file);
>>>> ...
>>>>
>>>> Typed Exception Syntax
>>>>
>>>> The typed exception syntax of throw() has the advantage of
>>>> plainly
>>>> indicating the nature of the trouble, since the name of the
>>>> class is
>>>> included in the title of the exception output.
>>>>
>>>> To take advantage of this capability, you must specify
>>>> arguments as
>>>> named parameters in the throw() call. Here are the
>>>> parameters:
>>>>
>>>> -class
>>>> name of the class of the exception. This should be one
>>>> of the
>>>> classes defined in Bio::Root::Exception, or a custom
>>>> error of yours
>>>> that extends one of the exceptions defined in
>>>> Bio::Root::Exception.
>>>>
>>>> -text
>>>> a sensible message for the exception
>>>>
>>>> -value
>>>> the value causing the exception or $!, if appropriate.
>>>>
>>>> Note that Bio::Root::Exception does not need to be imported
>>>> into your
>>>> module (or script) namespace in order to throw
>>>> exceptions via
>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it.
>>>>
>>>>
>>>> Chris
>>>>
>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
>>>>
>>>>
>>>>> Dear all,
>>>>>
>>>>> would it be possible to overload Bio::Root::RootI's 'throw'
>>>>> method to
>>>>> accept an additional, optional (positional) argument to define the
>>>>> exception class, e.g. using Exception::Class:
>>>>>
>>>>> # ...somewhere ...
>>>>>
>>>>> sub makefh {
>>>>> my ( $self, $filename ) = @_;
>>>>> open my $fh, '<' $filename or $self->throw("Can't open
>>>>> file: $!",
>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>>>>> return $fh;
>>>>> }
>>>>>
>>>>> #.... somewhere else
>>>>> my $fh;
>>>>> eval {
>>>>> $fh = $obj->makefh( 'data.txt');
>>>>> }
>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>>>>> # something's wrong with the file?
>>>>> }
>>>>>
>>>>> --
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Rutger Vos, PhD. candidate
>>>>> Department of Biological Sciences
>>>>> Simon Fraser University
>>>>> 8888 University Drive
>>>>> Burnaby, BC, V5A1S6
>>>>> Phone: 604-291-5625
>>>>> Fax: 604-291-3496
>>>>> Personal site: http://www.sfu.ca/~rvosa
>>>>> FAB* lab: http://www.sfu.ca/~fabstar
>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Rutger Vos, PhD. candidate
>>> Department of Biological Sciences
>>> Simon Fraser University
>>> 8888 University Drive
>>> Burnaby, BC, V5A1S6
>>> Phone: 604-291-5625
>>> Fax: 604-291-3496
>>> Personal site: http://www.sfu.ca/~rvosa
>>> FAB* lab: http://www.sfu.ca/~fabstar
>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From frederick.partridge at st-johns.oxford.ac.uk Wed Jul 12 11:16:33 2006
From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge)
Date: Wed, 12 Jul 2006 16:16:33 +0100 (BST)
Subject: [Bioperl-l] Get nucleotide sequence when expecting
proteinfromgenpept
In-Reply-To: <000201c6a53c$03970ed0$15327e82@pyrimidine>
References: <000201c6a53c$03970ed0$15327e82@pyrimidine>
Message-ID:
On Tue, 11 Jul 2006, Chris Fields wrote:
> This returns both the nucleotide sequence and the correct protein sequence;
> the protein was returned second for some reason, so get_Seq_by_acc misses it
> while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but
> they will likely just tell me to use the GI number for searches as they are
> unique. Probably a good warning for anyone using accessions for all their
> work (I use the GI myself).
Thank you both for your help, I have converted to GIs and it works much
better.
As an aside, it might be nice to have a $hit->gi method as well as
$hit->accession for parsing blast reports. (I now realise that you can
derive the gi from $hit->name, but this might have encouraged me to start
off using gi instead of accession numbers).
Freddie Partridge
University of Oxford
From cjfields at uiuc.edu Wed Jul 12 11:39:39 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 12 Jul 2006 10:39:39 -0500
Subject: [Bioperl-l] Get nucleotide sequence when expecting
proteinfromgenpept
In-Reply-To:
Message-ID: <000b01c6a5c9$635a7540$15327e82@pyrimidine>
Problem is, you may or may not have GIs for a BLAST hit depending on how you
retrieve the BLAST report, what interface you use, etc. NCBI is pretty
ambiguous when it comes to GI vs. accession; the sequence database guys want
you to use the GI for searches (since that's the unique ID for NCBI's
databases) and don't promise getting the correct sequence using the
accession.
However, the BLAST interface guys have set up the BLAST CGI server to not
return the GI by default(accessible through Bio::Tools::Run::RemoteBlast).
Even more confusing, if you use the NCBI BLAST web interface, this option is
turned on by default. Don't know what blastcl3 or blastall does, haven't
checked in a while.
Anyway, this could be why there is no $hit->gi method for
GenericHit/BlastHit. It could be added; I will need to look at
SearchIO::blast/blastxml/blasttable to see how this is parsed out.
BTW, what I do as a work-around, when using RemoteBlast, is below (you could
use the while loop to grab the GIs using SearchIO::blast if they are present
in the BLAST report). This grabs all the GI's from the description line
(not just the best hit).
# sets retrieval header to include the GI always
$Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes';
...
while ( my $hit = $result->next_hit) {
my $description = $hit->description;
while ($description =~ /gi\|(.*?)\|/g) {
my $gi = $1;
push @gis, $gi;
}
}
Chris
> -----Original Message-----
> From: Frederick Partridge [mailto:frederick.partridge at st-
> johns.oxford.ac.uk]
> Sent: Wednesday, July 12, 2006 10:17 AM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting
> proteinfromgenpept
>
>
>
> On Tue, 11 Jul 2006, Chris Fields wrote:
> > This returns both the nucleotide sequence and the correct protein
> sequence;
> > the protein was returned second for some reason, so get_Seq_by_acc
> misses it
> > while get_Stream_by_acc doesn't. I have notified NCBI about this issue,
> but
> > they will likely just tell me to use the GI number for searches as they
> are
> > unique. Probably a good warning for anyone using accessions for all
> their
> > work (I use the GI myself).
>
>
> Thank you both for your help, I have converted to GIs and it works much
> better.
>
> As an aside, it might be nice to have a $hit->gi method as well as
> $hit->accession for parsing blast reports. (I now realise that you can
> derive the gi from $hit->name, but this might have encouraged me to start
> off using gi instead of accession numbers).
>
>
> Freddie Partridge
>
> University of Oxford
>
From Steve_Chervitz at affymetrix.com Wed Jul 12 14:53:22 2006
From: Steve_Chervitz at affymetrix.com (Steve_Chervitz)
Date: Wed, 12 Jul 2006 11:53:22 -0700
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu>
References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine>
<337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com>
<18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu>
Message-ID: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com>
For modules that derive from Bio::Root::Root, there's no need to
import Bio::Root::Exception since the Root object does it.
I also favor adding the -class parameter to throw_not_implemented in
RootI. I just committed this change in in bioperl-live. I also added
a test for it in t/RootI.t
I haven't run the complete suite of tests after making this change,
but I don't suspect there'll be any trouble (famous last words).
Really, if any test leads to the calling of throw_not_implemented
(besides the test I just added), that in itself is trouble.
Steve
On Jul 11, 2006, at 8:27 PM, Chris Fields wrote:
> Makes sense to keep most of the magic in Root instead of RootI.pm.
> The POD for RootI does state that the class exception thrown is
> Bio::Root::NotImplemented, so we should probably either change the
> POD to reflect what really happens or change throw_not_implemented
> like you suggest (my vote is the latter). I don't think many (if
> any) implementing classes fall into your 'trouble' category, though I
> can't be sure how many actually import Bio::Root::Exception.
>
> Chris
>
> On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote:
>
>> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote:
>>
>>> Bio::Root::Root doesn't overload throw_not_implemented from
>>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B
>>> couldn't
>>> work out some of the Error.pm issues.
>>
>> The issue (I believe) was that
>> Bio::Root::RootI::throw_not_implemented was doing some checking for
>> the presence of Error.pm and calling Error::throw. I changed it so
>> that this fanciness only happens in Root.pm.
>>
>>> Judging by the POD for Bio::Root::RootI, throw_not_implemented
>>> doesn't
>>> accept arguments; it throws a Bio::Root::NotImplemented exception
>>> automatically.
>>
>> Looking at the code now, throw_not_implemented() does not throw a
>> Bio::Root::NotImplemented exception. It just throws a simple,
>> unclassed message. We could allow it to throw an exception of class
>> Bio::Root:NotImplemented by changing this code:
>>
>> if( $self->can('throw') ) {
>> $self->throw($message);
>> }...
>>
>> to this
>>
>> if( $self->can('throw') ) {
>> $self->throw(-text=>$message, -
>> class=>'Bio::Root::NotImplemented');
>> }...
>>
>> This does not create any dependency on Error.pm, but permits it to
>> be used if available. If Error.pm is not loaded, the only change is
>> that the class string is included in the error message, which is
>> kind of handy.
>>
>> Trouble would occur if the implementing class:
>>
>> * does not derive from Bio::Root::Root,
>> * does not import Bio::Root::Exception,
>> * fails to implement a method which gets called, and
>> * Error.pm is available.
>>
>> I don't know if such implementations exist in bioperl now, but I
>> suspect they would be rare (and discouraged).
>>
>> Steve
>>
>>
>>> Chris
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos
>>>> Sent: Tuesday, July 11, 2006 1:58 PM
>>>> To: Chris Fields
>>>> Cc: 'Bioperl List'
>>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class)
>>>> overloading?
>>>>
>>>> I must have overlooked this. I think it does what I want. So
>>>> could I do
>>>> something like:
>>>>
>>>> $obj->thow_not_implemented( -class =>
>>>> 'Bio::Root::NotImplemented' );
>>>>
>>>> ...in interfaces?
>>>>
>>>> Chris Fields wrote:
>>>>> I suppose you could; Bio::Root::Root does that using Error.pm
>>>>> (if it
>>>>> is installed). It almost sounds like what Bio::Root::Root does is
>>>>> what you want, but you want a little more information when
>>>>> exceptions
>>>>> are thrown maybe?
>>>>>
>>>>> from perldoc Bio::Root::Root:
>>>>>
>>>>> ...
>>>>> # Alternatively, using the new typed exception syntax in
>>>>> the throw() call:
>>>>>
>>>>> $obj->throw( -class => 'Bio::Root::BadParameter',
>>>>> -text => "Can not open file $file",
>>>>> -value => $file);
>>>>> ...
>>>>>
>>>>> Typed Exception Syntax
>>>>>
>>>>> The typed exception syntax of throw() has the advantage of
>>>>> plainly
>>>>> indicating the nature of the trouble, since the name of
>>>>> the
>>>>> class is
>>>>> included in the title of the exception output.
>>>>>
>>>>> To take advantage of this capability, you must specify
>>>>> arguments as
>>>>> named parameters in the throw() call. Here are the
>>>>> parameters:
>>>>>
>>>>> -class
>>>>> name of the class of the exception. This should be
>>>>> one
>>>>> of the
>>>>> classes defined in Bio::Root::Exception, or a custom
>>>>> error of yours
>>>>> that extends one of the exceptions defined in
>>>>> Bio::Root::Exception.
>>>>>
>>>>> -text
>>>>> a sensible message for the exception
>>>>>
>>>>> -value
>>>>> the value causing the exception or $!, if appropriate.
>>>>>
>>>>> Note that Bio::Root::Exception does not need to be
>>>>> imported
>>>>> into your
>>>>> module (or script) namespace in order to throw
>>>>> exceptions via
>>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports
>>>>> it.
>>>>>
>>>>>
>>>>> Chris
>>>>>
>>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
>>>>>
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> would it be possible to overload Bio::Root::RootI's 'throw'
>>>>>> method to
>>>>>> accept an additional, optional (positional) argument to define
>>>>>> the
>>>>>> exception class, e.g. using Exception::Class:
>>>>>>
>>>>>> # ...somewhere ...
>>>>>>
>>>>>> sub makefh {
>>>>>> my ( $self, $filename ) = @_;
>>>>>> open my $fh, '<' $filename or $self->throw("Can't open
>>>>>> file: $!",
>>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument
>>>>>> return $fh;
>>>>>> }
>>>>>>
>>>>>> #.... somewhere else
>>>>>> my $fh;
>>>>>> eval {
>>>>>> $fh = $obj->makefh( 'data.txt');
>>>>>> }
>>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
>>>>>> # something's wrong with the file?
>>>>>> }
>>>>>>
>>>>>> --
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Rutger Vos, PhD. candidate
>>>>>> Department of Biological Sciences
>>>>>> Simon Fraser University
>>>>>> 8888 University Drive
>>>>>> Burnaby, BC, V5A1S6
>>>>>> Phone: 604-291-5625
>>>>>> Fax: 604-291-3496
>>>>>> Personal site: http://www.sfu.ca/~rvosa
>>>>>> FAB* lab: http://www.sfu.ca/~fabstar
>>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>
>>>>> Christopher Fields
>>>>> Postdoctoral Researcher
>>>>> Lab of Dr. Robert Switzer
>>>>> Dept of Biochemistry
>>>>> University of Illinois Urbana-Champaign
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Rutger Vos, PhD. candidate
>>>> Department of Biological Sciences
>>>> Simon Fraser University
>>>> 8888 University Drive
>>>> Burnaby, BC, V5A1S6
>>>> Phone: 604-291-5625
>>>> Fax: 604-291-3496
>>>> Personal site: http://www.sfu.ca/~rvosa
>>>> FAB* lab: http://www.sfu.ca/~fabstar
>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Wed Jul 12 15:23:33 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 12 Jul 2006 14:23:33 -0500
Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
In-Reply-To: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com>
Message-ID: <000901c6a5e8$aaca53e0$15327e82@pyrimidine>
Thanks Steve!
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Steve_Chervitz
> Sent: Wednesday, July 12, 2006 1:53 PM
> To: Chris Fields
> Cc: Rutger Vos; Bioperl List
> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading?
>
> For modules that derive from Bio::Root::Root, there's no need to
> import Bio::Root::Exception since the Root object does it.
>
> I also favor adding the -class parameter to throw_not_implemented in
> RootI. I just committed this change in in bioperl-live. I also added
> a test for it in t/RootI.t
>
> I haven't run the complete suite of tests after making this change,
> but I don't suspect there'll be any trouble (famous last words).
> Really, if any test leads to the calling of throw_not_implemented
> (besides the test I just added), that in itself is trouble.
>
> Steve
>
> On Jul 11, 2006, at 8:27 PM, Chris Fields wrote:
>
> > Makes sense to keep most of the magic in Root instead of RootI.pm.
> > The POD for RootI does state that the class exception thrown is
> > Bio::Root::NotImplemented, so we should probably either change the
> > POD to reflect what really happens or change throw_not_implemented
> > like you suggest (my vote is the latter). I don't think many (if
> > any) implementing classes fall into your 'trouble' category, though I
> > can't be sure how many actually import Bio::Root::Exception.
> >
> > Chris
> >
> > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote:
> >
> >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote:
> >>
> >>> Bio::Root::Root doesn't overload throw_not_implemented from
> >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B
> >>> couldn't
> >>> work out some of the Error.pm issues.
> >>
> >> The issue (I believe) was that
> >> Bio::Root::RootI::throw_not_implemented was doing some checking for
> >> the presence of Error.pm and calling Error::throw. I changed it so
> >> that this fanciness only happens in Root.pm.
> >>
> >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented
> >>> doesn't
> >>> accept arguments; it throws a Bio::Root::NotImplemented exception
> >>> automatically.
> >>
> >> Looking at the code now, throw_not_implemented() does not throw a
> >> Bio::Root::NotImplemented exception. It just throws a simple,
> >> unclassed message. We could allow it to throw an exception of class
> >> Bio::Root:NotImplemented by changing this code:
> >>
> >> if( $self->can('throw') ) {
> >> $self->throw($message);
> >> }...
> >>
> >> to this
> >>
> >> if( $self->can('throw') ) {
> >> $self->throw(-text=>$message, -
> >> class=>'Bio::Root::NotImplemented');
> >> }...
> >>
> >> This does not create any dependency on Error.pm, but permits it to
> >> be used if available. If Error.pm is not loaded, the only change is
> >> that the class string is included in the error message, which is
> >> kind of handy.
> >>
> >> Trouble would occur if the implementing class:
> >>
> >> * does not derive from Bio::Root::Root,
> >> * does not import Bio::Root::Exception,
> >> * fails to implement a method which gets called, and
> >> * Error.pm is available.
> >>
> >> I don't know if such implementations exist in bioperl now, but I
> >> suspect they would be rare (and discouraged).
> >>
> >> Steve
> >>
> >>
> >>> Chris
> >>>
> >>>> -----Original Message-----
> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos
> >>>> Sent: Tuesday, July 11, 2006 1:58 PM
> >>>> To: Chris Fields
> >>>> Cc: 'Bioperl List'
> >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class)
> >>>> overloading?
> >>>>
> >>>> I must have overlooked this. I think it does what I want. So
> >>>> could I do
> >>>> something like:
> >>>>
> >>>> $obj->thow_not_implemented( -class =>
> >>>> 'Bio::Root::NotImplemented' );
> >>>>
> >>>> ...in interfaces?
> >>>>
> >>>> Chris Fields wrote:
> >>>>> I suppose you could; Bio::Root::Root does that using Error.pm
> >>>>> (if it
> >>>>> is installed). It almost sounds like what Bio::Root::Root does is
> >>>>> what you want, but you want a little more information when
> >>>>> exceptions
> >>>>> are thrown maybe?
> >>>>>
> >>>>> from perldoc Bio::Root::Root:
> >>>>>
> >>>>> ...
> >>>>> # Alternatively, using the new typed exception syntax in
> >>>>> the throw() call:
> >>>>>
> >>>>> $obj->throw( -class => 'Bio::Root::BadParameter',
> >>>>> -text => "Can not open file $file",
> >>>>> -value => $file);
> >>>>> ...
> >>>>>
> >>>>> Typed Exception Syntax
> >>>>>
> >>>>> The typed exception syntax of throw() has the advantage of
> >>>>> plainly
> >>>>> indicating the nature of the trouble, since the name of
> >>>>> the
> >>>>> class is
> >>>>> included in the title of the exception output.
> >>>>>
> >>>>> To take advantage of this capability, you must specify
> >>>>> arguments as
> >>>>> named parameters in the throw() call. Here are the
> >>>>> parameters:
> >>>>>
> >>>>> -class
> >>>>> name of the class of the exception. This should be
> >>>>> one
> >>>>> of the
> >>>>> classes defined in Bio::Root::Exception, or a custom
> >>>>> error of yours
> >>>>> that extends one of the exceptions defined in
> >>>>> Bio::Root::Exception.
> >>>>>
> >>>>> -text
> >>>>> a sensible message for the exception
> >>>>>
> >>>>> -value
> >>>>> the value causing the exception or $!, if appropriate.
> >>>>>
> >>>>> Note that Bio::Root::Exception does not need to be
> >>>>> imported
> >>>>> into your
> >>>>> module (or script) namespace in order to throw
> >>>>> exceptions via
> >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports
> >>>>> it.
> >>>>>
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote:
> >>>>>
> >>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> would it be possible to overload Bio::Root::RootI's 'throw'
> >>>>>> method to
> >>>>>> accept an additional, optional (positional) argument to define
> >>>>>> the
> >>>>>> exception class, e.g. using Exception::Class:
> >>>>>>
> >>>>>> # ...somewhere ...
> >>>>>>
> >>>>>> sub makefh {
> >>>>>> my ( $self, $filename ) = @_;
> >>>>>> open my $fh, '<' $filename or $self->throw("Can't open
> >>>>>> file: $!",
> >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument
> >>>>>> return $fh;
> >>>>>> }
> >>>>>>
> >>>>>> #.... somewhere else
> >>>>>> my $fh;
> >>>>>> eval {
> >>>>>> $fh = $obj->makefh( 'data.txt');
> >>>>>> }
> >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) {
> >>>>>> # something's wrong with the file?
> >>>>>> }
> >>>>>>
> >>>>>> --
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Rutger Vos, PhD. candidate
> >>>>>> Department of Biological Sciences
> >>>>>> Simon Fraser University
> >>>>>> 8888 University Drive
> >>>>>> Burnaby, BC, V5A1S6
> >>>>>> Phone: 604-291-5625
> >>>>>> Fax: 604-291-3496
> >>>>>> Personal site: http://www.sfu.ca/~rvosa
> >>>>>> FAB* lab: http://www.sfu.ca/~fabstar
> >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioperl-l mailing list
> >>>>>> Bioperl-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>>
> >>>>>
> >>>>> Christopher Fields
> >>>>> Postdoctoral Researcher
> >>>>> Lab of Dr. Robert Switzer
> >>>>> Dept of Biochemistry
> >>>>> University of Illinois Urbana-Champaign
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Rutger Vos, PhD. candidate
> >>>> Department of Biological Sciences
> >>>> Simon Fraser University
> >>>> 8888 University Drive
> >>>> Burnaby, BC, V5A1S6
> >>>> Phone: 604-291-5625
> >>>> Fax: 604-291-3496
> >>>> Personal site: http://www.sfu.ca/~rvosa
> >>>> FAB* lab: http://www.sfu.ca/~fabstar
> >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From dsche at uga.edu Thu Jul 13 14:55:03 2006
From: dsche at uga.edu (Dongsheng Che)
Date: Thu, 13 Jul 2006 14:55:03 -0400 (EDT)
Subject: [Bioperl-l] remoteBlast problem
Message-ID: <20060713145503.CIV61560@punts2.cc.uga.edu>
To whom it may concern:
I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and followed the installation procedure, ie, perl Makefile.PL, make, make test. make install. I know there are some installation failure during the installation.
Since my main purpose is to get remoteBlast worked, I don't want bother to figure out all failures. but I run remote Blast, it gave me some erorrs from examples (bptutorial).
-------------------------------------------------------------
Beginning run_remoteblast example...
Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303.
**Warning**: Couldn't connect to NCBI with Bio::Tools::Run::StandAloneBlast.pm!
Probably no network access.
Skipping Test
----------------------------------------------------------------
I wondering what cause the problem.
Thanks in advance!
Dongsheng
From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:39:19 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Thu, 13 Jul 2006 18:39:19 -0400
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
Message-ID: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca>
Hello Again,
I have another question regarding Remote blast but this time using Genome Blast.
Here is the link:
http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
which again uses the main Blast web site:
http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
Again I am not sure what to add or what HEADER information to change within my
script.
Here is my program, which was the same as the last email:
#!/usr/bin/perl -w
use Bio::Perl;
use Bio::Tools::Run::RemoteBlast;
my $prog = "blastn";
my $db = "refseq_genomic";
my $e_val = 0.01;
my @params = ( '-prog' => $prog,
'-data' => $db,
'-expect' => $e_val);
my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
$Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- what
do I put here
#$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need to add
any other values to the form inputs
$factory->submit_blast("blast.in");
$v = 1;
while (my @rids = $factory->each_rid)
{ foreach my $rid ( @rids )
{ my $rc = $factory->retrieve_blast($rid);
if( !ref($rc) )
{ if( $rc < 0 )
{ $factory->remove_rid($rid);
}
print STDERR "." if ( $v > 0 );
sleep 5;
}
else
{ my $result = $rc->next_result();
my $filename = $result->query_name()."\.out";
$factory->save_output($filename);
$factory->remove_rid($rid);
print "\nQuery Name: ", $result->query_name(), "\n";
}
}
}
Both of my questions are very similiar as in I know how to use remote blast but
not sure what to change to access the specific blast I want.
Again, any help would be very appreciated!!
Rohan
----------------------------------------
This mail sent through www.mywaterloo.ca
From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:31:38 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Thu, 13 Jul 2006 18:31:38 -0400
Subject: [Bioperl-l] Remote Blast - SNP data base
Message-ID: <1152829898.44b6c9cab7a3a@www.nexusmail.uwaterloo.ca>
Hello,
1. I was wondering if anyone knew how to use SNP Blast via the Remote Blast
module?? Basically I want to blast my sequence against the dbSNP database and
you can normally do this through NCBI's website:
http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi
The site basically takes your info and submits it to the main blast site:
http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
I am just not sure what settings to change within my script. I have something
like this:
#!/usr/bin/perl -w
use Bio::Perl;
use Bio::Tools::Run::RemoteBlast;
my $prog = "blastn";
my $db = "refseq_genomic"; <--- What db should I use??
my $e_val = 0.01;
my @params = ( '-prog' => $prog,
'-data' => $db,
'-expect' => $e_val);
my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
$factory->submit_blast("blast.in"); <--- Name of my file in fasta format
$v = 1;
while (my @rids = $factory->each_rid)
{ foreach my $rid ( @rids )
{ my $rc = $qu->retrieve_blast($rid);
if( !ref($rc) )
{ if( $rc < 0 )
{ $factory->remove_rid($rid);
}
print STDERR "." if ( $v > 0 );
sleep 5;
}
else
{ my $result = $rc->next_result();
my $filename = $result->query_name()."\.out";
$factory->save_output($filename);
$factory->remove_rid($rid);
print "\nQuery Name: ", $result->query_name(), "\n";
}
}
}
I think something like this should be added to have the correct form inputs but
I am unsure:
$Bio::Tools::Run::RemoteBlast::HEADER{'???'} = '????';
Any help on this topic would greatly be appreciated!!
Rohan
----------------------------------------
This mail sent through www.mywaterloo.ca
From cjfields at uiuc.edu Thu Jul 13 20:42:57 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 13 Jul 2006 19:42:57 -0500
Subject: [Bioperl-l] remoteBlast problem
In-Reply-To: <20060713145503.CIV61560@punts2.cc.uga.edu>
Message-ID: <000401c6a6de$737fe570$15327e82@pyrimidine>
1) Before I get wound up in the obvious here, you need to upgrade to CVS;
RemoteBlast and SearchIO::blast were fixed post v.-1.5.1 (i.e. in CVS) to
account for changes in BLAST output at the NCBI
2) The Bio::Tools::Run::StandAloneBlast.pm bit worried me a little, so I
did a little digging; that's a typo. Now corrected in CVS, along with some
BPLite cruft left over.
3) Speaking bluntly? Come on. The error is stated as plainly as possible.
No? How about this (note the arrows):
-----------> **Warning**: Couldn't connect to NCBI with
-----------> Bio::Tools::Run::StandAloneBlast.pm!
-----------> Probably no network access.
Skipping Test
Check your network connections, preferably AFTER you update to CVS. It's
possible that it's a proxy issue, but that should also be fixed in CVS.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Dongsheng Che
> Sent: Thursday, July 13, 2006 1:55 PM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] remoteBlast problem
>
> To whom it may concern:
>
> I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and
> followed the installation procedure, ie, perl Makefile.PL, make, make
> test. make install. I know there are some installation failure during the
> installation.
>
> Since my main purpose is to get remoteBlast worked, I don't want bother to
> figure out all failures. but I run remote Blast, it gave me some erorrs
> from examples (bptutorial).
> -------------------------------------------------------------
> Beginning run_remoteblast example...
> Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303.
>
>
> **Warning**: Couldn't connect to NCBI with
> Bio::Tools::Run::StandAloneBlast.pm!
> Probably no network access.
> Skipping Test
> ----------------------------------------------------------------
>
> I wondering what cause the problem.
>
> Thanks in advance!
>
> Dongsheng
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Thu Jul 13 21:56:30 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 13 Jul 2006 20:56:30 -0500
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
In-Reply-To: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca>
Message-ID: <000501c6a6e8$b9c24d20$15327e82@pyrimidine>
I added a method to RemoteBlast in bioperl-live (CVS) if you want to play
with changing the URL. I have been thinking about doing this for a bit now
but I already see problems.
Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note
the differences in the URL) but a user-friendly request page, generated on
the fly by Genome, to submit BLAST requests for the relevant database. So
changing the URL will not work (even by adding extra parameters); you only
get the original HTML web page.
You could try changing the database or limiting the search using an Entrez
term (which you should be able to include in the request, probably by adding
it to the HEADER).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca
> Sent: Thursday, July 13, 2006 5:39 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Remote Blast - Blast Human Genome
>
>
> Hello Again,
>
> I have another question regarding Remote blast but this time using Genome
> Blast.
>
> Here is the link:
>
> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
>
> which again uses the main Blast web site:
>
> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
>
> Again I am not sure what to add or what HEADER information to change
> within my
> script.
>
> Here is my program, which was the same as the last email:
>
> #!/usr/bin/perl -w
>
> use Bio::Perl;
> use Bio::Tools::Run::RemoteBlast;
>
> my $prog = "blastn";
> my $db = "refseq_genomic";
> my $e_val = 0.01;
>
> my @params = ( '-prog' => $prog,
> '-data' => $db,
> '-expect' => $e_val);
>
> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-----
> what
> do I put here
> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need
> to add
> any other values to the form inputs
>
> $factory->submit_blast("blast.in");
> $v = 1;
>
> while (my @rids = $factory->each_rid)
> { foreach my $rid ( @rids )
> { my $rc = $factory->retrieve_blast($rid);
> if( !ref($rc) )
> { if( $rc < 0 )
> { $factory->remove_rid($rid);
> }
> print STDERR "." if ( $v > 0 );
> sleep 5;
> }
> else
> { my $result = $rc->next_result();
> my $filename = $result->query_name()."\.out";
> $factory->save_output($filename);
> $factory->remove_rid($rid);
> print "\nQuery Name: ", $result->query_name(), "\n";
> }
> }
> }
>
>
> Both of my questions are very similiar as in I know how to use remote
> blast but
> not sure what to change to access the specific blast I want.
>
> Again, any help would be very appreciated!!
>
> Rohan
>
>
>
> ----------------------------------------
> This mail sent through www.mywaterloo.ca
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From smart_bioit at yahoo.com Fri Jul 14 13:25:51 2006
From: smart_bioit at yahoo.com (raj sharma)
Date: Fri, 14 Jul 2006 10:25:51 -0700 (PDT)
Subject: [Bioperl-l] advice
Message-ID: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com>
i have one problem in perl
i want to make one program which whn run online
can download required data from data bank to local server
frm where i shld start
---------------------------------
Yahoo! Music Unlimited - Access over 1 million songs.Try it free.
From charlesh at stedwards.edu Sat Jul 15 15:29:46 2006
From: charlesh at stedwards.edu (Charles Hauser)
Date: Sat, 15 Jul 2006 14:29:46 -0500
Subject: [Bioperl-l] Finding locations of a string within a fasta file
Message-ID: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu>
All,
I'm trying to determine where (the start .. end positions) within a
genomic scaffold sequence gaps occur.
The gaps are denoted as runs of N's.
Suggestions on how to easily retrieve this would be appreciated.
ch
From cjfields at uiuc.edu Sat Jul 15 17:22:15 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 15 Jul 2006 16:22:15 -0500
Subject: [Bioperl-l] Finding locations of a string within a fasta file
In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu>
Message-ID: <000001c6a854$bee47400$15327e82@pyrimidine>
You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if
the format is set to 'gb' (it is now set to 'gbwithparts' by default. The
CONTIG lines are currently stored in a series of
Bio::Annotation::SimpleValue objects; get the accessions using the following
script.
use strict;
use warnings;
use Bio::DB::GenBank;
my $factory = Bio::DB::GenBank->new(-format => 'gb');
my $seq = $factory->get_Seq_by_id(shift);
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
-format => 'genbank');
# greps only annotations with CONTIG tagname, joins all together
my $contig = join '', grep {$_->tagname eq 'CONTIG'}
$seq->get_Annotations();
# split each region, getting rid of gaps and join(), then split into
acc/span
for (grep {$_ !~ m{gap|join}}
split ',', $contig) {
my ($acc, $span) = split ':', $_;
$span =~ s{\)}{}g; # spurious ')'
print "ACC: $acc\n\tSpan:$span\n";
}
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Charles Hauser
> Sent: Saturday, July 15, 2006 2:30 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Finding locations of a string within a fasta file
>
> All,
>
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
>
> Suggestions on how to easily retrieve this would be appreciated.
>
> ch
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From sudhaneti at yahoo.com Sat Jul 15 15:26:01 2006
From: sudhaneti at yahoo.com (Sudha Gunturu)
Date: Sat, 15 Jul 2006 12:26:01 -0700 (PDT)
Subject: [Bioperl-l] BLOSUM matrix
Message-ID: <20060715192601.36517.qmail@web53315.mail.yahoo.com>
Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated.
AILCAA
ALLLAA
ILIICL
Thanks
Sudha
---------------------------------
Do you Yahoo!?
Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
From charlesh at stedwards.edu Sun Jul 16 19:32:38 2006
From: charlesh at stedwards.edu (Charles Hauser)
Date: Sun, 16 Jul 2006 18:32:38 -0500
Subject: [Bioperl-l] Finding locations of a string within a fasta file
In-Reply-To: <000001c6a854$bee47400$15327e82@pyrimidine>
References: <000001c6a854$bee47400$15327e82@pyrimidine>
Message-ID:
Hi Chris,
Thanks for the info.
Unfortunately, I was not clear that the sequence is unannotated, i.e.
there is no GenBank record. I need to extract the locations of the
gaps from a raw fasta file.
ch
On Jul 15, 2006, at 4:22 PM, Chris Fields wrote:
> You can retrieve the original GenBank CONTIG file using
> Bio::DB::GenBank if
> the format is set to 'gb' (it is now set to 'gbwithparts' by
> default. The
> CONTIG lines are currently stored in a series of
> Bio::Annotation::SimpleValue objects; get the accessions using the
> following
> script.
>
> use strict;
> use warnings;
>
> use Bio::DB::GenBank;
>
> my $factory = Bio::DB::GenBank->new(-format => 'gb');
>
> my $seq = $factory->get_Seq_by_id(shift);
>
> my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
> -format => 'genbank');
>
> # greps only annotations with CONTIG tagname, joins all together
> my $contig = join '', grep {$_->tagname eq 'CONTIG'}
> $seq->get_Annotations();
>
> # split each region, getting rid of gaps and join(), then split into
> acc/span
> for (grep {$_ !~ m{gap|join}}
> split ',', $contig) {
> my ($acc, $span) = split ':', $_;
> $span =~ s{\)}{}g; # spurious ')'
> print "ACC: $acc\n\tSpan:$span\n";
> }
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Charles Hauser
>> Sent: Saturday, July 15, 2006 2:30 PM
>> To: bioperl-l at lists.open-bio.org
>> Subject: [Bioperl-l] Finding locations of a string within a fasta
>> file
>>
>> All,
>>
>> I'm trying to determine where (the start .. end positions) within a
>> genomic scaffold sequence gaps occur.
>> The gaps are denoted as runs of N's.
>>
>> Suggestions on how to easily retrieve this would be appreciated.
>>
>> ch
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:23:51 2006
From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann)
Date: Mon, 17 Jul 2006 12:23:51 +1000
Subject: [Bioperl-l] advice
In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com>
References: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com>
Message-ID: <44BAF4B7.8090508@infotech.monash.edu.au>
raj sharma wrote:
> i have one problem in perl
is this Bio::Perl related?
> i want to make one program which whn run online
do you mean runs on a web server as a CGI script, or access on-line data?
> can download required data from data bank to local server
which databank - genbank or ... ?
> frm where i shld start
http://www.oreilly.com/catalog/lperl3/
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:21:31 2006
From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann)
Date: Mon, 17 Jul 2006 12:21:31 +1000
Subject: [Bioperl-l] Finding locations of a string within a fasta file
In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu>
References: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu>
Message-ID: <44BAF42B.8080102@infotech.monash.edu.au>
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
> Suggestions on how to easily retrieve this would be appreciated.
First you need to get the sequence into a string within Perl. As your
email Subject: says it is in the Fasta file, you need to
1. open the fasta file - see Bio::SeqIO
2. read first sequence (as an object) - see next_seq()
3. get the string of the sequence in the object - see seq()
Then you could just use the inbuilt Perl function index() to loop
through all the occurences of 'N' - type 'perldoc -f index' for help.
Alternatively use regexp matching eg, m/(N+)/g and the pos() function.
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
From sudhaneti at yahoo.com Sun Jul 16 22:33:20 2006
From: sudhaneti at yahoo.com (Sudha Gunturu)
Date: Sun, 16 Jul 2006 19:33:20 -0700 (PDT)
Subject: [Bioperl-l] BLOSUM matrix
In-Reply-To: <44BAF316.9020301@infotech.monash.edu.au>
Message-ID: <20060717023320.6402.qmail@web53313.mail.yahoo.com>
Sorry for not being clear with my question. Let me try to explain. I want to Implement dynamic programing using Blosum as scoring matrix.
1. I want to know how to define the values of Blosum in an array.
2. What functions are suitable for global alignment of two sequences. Etc.,
Being a beginer programer want some direction, books, and good websites which can help me in achieving the implementation. It would be great if someone can walk me through this.
Thanks
Sudha
Torsten Seemann wrote:
Sudha,
> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in
general. Any inputs, websites to help with this are appreciated.
> AILCAA
> ALLLAA
> ILIICL
The BLOSUM65 matrix does not define a method for alignment, it just
provides some parameters. Perhaps you should read this first:
http://en.wikipedia.org/wiki/Sequence_alignment
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
---------------------------------
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail Beta.
From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:16:54 2006
From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann)
Date: Mon, 17 Jul 2006 12:16:54 +1000
Subject: [Bioperl-l] BLOSUM matrix
In-Reply-To: <20060715192601.36517.qmail@web53315.mail.yahoo.com>
References: <20060715192601.36517.qmail@web53315.mail.yahoo.com>
Message-ID: <44BAF316.9020301@infotech.monash.edu.au>
Sudha,
> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in
general. Any inputs, websites to help with this are appreciated.
> AILCAA
> ALLLAA
> ILIICL
The BLOSUM65 matrix does not define a method for alignment, it just
provides some parameters. Perhaps you should read this first:
http://en.wikipedia.org/wiki/Sequence_alignment
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
From smart_bioit at yahoo.com Mon Jul 17 00:21:41 2006
From: smart_bioit at yahoo.com (raj sharma)
Date: Sun, 16 Jul 2006 21:21:41 -0700 (PDT)
Subject: [Bioperl-l] advice
In-Reply-To: <44BAF4B7.8090508@infotech.monash.edu.au>
Message-ID: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com>
hi trston
well here i will make u clear my problem
i want to make one data base of marine species u can say this as mirror of data
so at present whn i click there on line data base of ncbi gets open
so i want to dowload data of marine species (ny one)
nd whn ever i click on tht link local data which i have downloaded shld open
nd data shld also b updated online after some time
waiting for ur reply
---------------------------------
Do you Yahoo!?
Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
From cjfields at uiuc.edu Mon Jul 17 00:51:20 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 16 Jul 2006 23:51:20 -0500
Subject: [Bioperl-l] BLOSUM matrix
In-Reply-To: <20060717023320.6402.qmail@web53313.mail.yahoo.com>
References: <20060717023320.6402.qmail@web53313.mail.yahoo.com>
Message-ID:
Hmm, beginner programmer, wants to learn perl? Here are some
directions:
http://learn.perl.org/
Start with Schwartz's latest incarnation of Learning Perl, then work
your way up to Intermediate Perl (I think Mastering Perl is on the
horizon...)
For some pointers using Perl and bioinformatics, pick up Tisdall's
books Beginning/Mastering Perl for Bioinformatics.
This is really a list for bioperl, not perl and bioinformatics
(thought the two cross here all the time!). We normally don't mind
answering questions but we typically don't do people's homework
unless we're unusually bored. And we can be excessively cranky when
someone repeatedly posts requests for something that shouldn't take
much reading and Googling to find out. Again, we're not into that
homework gig, i.e. 'walking you through it' is tantamount to 'doing
it for you.'
1) Arrays and how to use them are in Learning Perl; there are
probably better ways to do this than an array, though...
2) Use Torsten's link to get you started.
Chris
On Jul 16, 2006, at 9:33 PM, Sudha Gunturu wrote:
> Sorry for not being clear with my question. Let me try to
> explain. I want to Implement dynamic programing using Blosum as
> scoring matrix.
>
> 1. I want to know how to define the values of Blosum in an array.
> 2. What functions are suitable for global alignment of two
> sequences. Etc.,
>
> Being a beginer programer want some direction, books, and good
> websites which can help me in achieving the implementation. It
> would be great if someone can walk me through this.
>
> Thanks
> Sudha
>
> Torsten Seemann wrote:
> Sudha,
>
>> Being a beginner perl programming, was wondering if anyone can
>> help me with implementation of BLOSUM 65 matrix for the following
>> alignments or in
> general. Any inputs, websites to help with this are appreciated.
>> AILCAA
>> ALLLAA
>> ILIICL
>
> The BLOSUM65 matrix does not define a method for alignment, it just
> provides some parameters. Perhaps you should read this first:
>
> http://en.wikipedia.org/wiki/Sequence_alignment
>
> --
> Dr Torsten Seemann http://www.vicbioinformatics.com
> Victorian Bioinformatics Consortium, Monash University, Australia
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail Beta.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Mon Jul 17 01:01:53 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 00:01:53 -0500
Subject: [Bioperl-l] advice
In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com>
References: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com>
Message-ID: <82C51420-A18B-4DEA-A519-CE1D7B9C7B10@uiuc.edu>
This is a Bioperl list. If you don't have a Bioperl-related
question, you will very likely get testy replies.
I don't believe that you quite understand Torsten's response, so I'll
just copy-and-paste from a reply I just gave a second ago to save
myself the typing:
Hmm, beginner programmer, wants to learn perl? Here are some
directions:
http://learn.perl.org/
Start with Schwartz's latest incarnation of Learning Perl, then work
your way up to Intermediate Perl (I think Mastering Perl is on the
horizon...)
For some pointers using Perl and bioinformatics, pick up Tisdall's
books Beginning/Mastering Perl for Bioinformatics.
This is really a list for bioperl, not perl and bioinformatics
(thought the two cross here all the time!). We normally don't mind
answering questions but we typically don't do people's homework
unless we're unusually bored. And we can be excessively cranky when
someone repeatedly posts requests for something that shouldn't take
much reading and Googling to find out. Again, we're not into that
homework gig, i.e. 'walking you through it' is tantamount to 'doing
it for you.'
For your particular instance, you might want to brush up on web
services, CGI, and a little web etiquette.
http://catb.org/esr/faqs/smart-questions.html
I think you may be waiting for a long time for a reply!
Chris
On Jul 16, 2006, at 11:21 PM, raj sharma wrote:
>
> hi trston
> well here i will make u clear my problem
>
> i want to make one data base of marine species u can say this as
> mirror of data
>
>
> so at present whn i click there on line data base of ncbi gets open
>
> so i want to dowload data of marine species (ny one)
> nd whn ever i click on tht link local data which i have
> downloaded shld open
> nd data shld also b updated online after some time
>
> waiting for ur reply
>
>
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bmoore at genetics.utah.edu Mon Jul 17 01:25:32 2006
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Sun, 16 Jul 2006 23:25:32 -0600
Subject: [Bioperl-l] advice
In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com>
Message-ID:
By reading this:
http://catb.org/esr/faqs/smart-questions.html
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma
Sent: Friday, July 14, 2006 11:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] advice
i have one problem in perl
i want to make one program which whn run online
can download required data from data bank to local server
frm where i shld start
---------------------------------
Yahoo! Music Unlimited - Access over 1 million songs.Try it free.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bmoore at genetics.utah.edu Mon Jul 17 01:34:58 2006
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Sun, 16 Jul 2006 23:34:58 -0600
Subject: [Bioperl-l] advice
In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com>
Message-ID:
If you're on a unix type system look at wget -mirror and it's
variations.
B
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma
Sent: Sunday, July 16, 2006 10:22 PM
To: Torsten Seemann
Subject: Re: [Bioperl-l] advice
hi trston
well here i will make u clear my problem
i want to make one data base of marine species u can say this as
mirror of data
so at present whn i click there on line data base of ncbi gets open
so i want to dowload data of marine species (ny one)
nd whn ever i click on tht link local data which i have downloaded
shld open
nd data shld also b updated online after some time
waiting for ur reply
---------------------------------
Do you Yahoo!?
Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Mon Jul 17 10:32:13 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 17 Jul 2006 15:32:13 +0100
Subject: [Bioperl-l] Bio::Map changes
In-Reply-To: <44ACCCD5.3030309@sendu.me.uk>
References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk>
<44ACCCD5.3030309@sendu.me.uk>
Message-ID: <44BB9F6D.10005@sendu.me.uk>
Sendu Bala wrote:
> Sendu Bala wrote:
>> The reimplementation will make Position central to the model, allowing
>> for lots of other things to work properly without anything becoming
>> inconsistent (as is currently the case).
>
> This is now done. It uses a new PositionHandler class behind the scenes.
>
> The next step is to introduce relative positioning across the board
This is now done. It uses a new Relative class to describe what a given
position is relative to.
I also made Bio::Map:MapI an AnnotableI and SimpleMap an implementor.
I think this pretty much brings an end to my changes to Bio::Map. Unless
anyone thinks the changes lack sanity, I think the API of the new things
should be somewhat stable.
> possibly in a way that makes OrderedPosition redundant or an implementer
> of the system.
I haven't yet touched the other kinds of Positions to update/remove
them. Docs in general could probably do with an update/ improvement. I
plan to do this 'soon'.
From golharam at umdnj.edu Mon Jul 17 10:13:20 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Mon, 17 Jul 2006 10:13:20 -0400
Subject: [Bioperl-l] advice
In-Reply-To:
Message-ID: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1>
I apologize that this is off-topic, but it is an interesting email.
Notice the lack of vowels (whn, ny, nd, shld, b) however in other
words, the vowels are clearly included.
Am I getting old or is "internet spelling" starting to differ from
"english spelling"? Or is it that the younger generation (not that I'm
old...a mere 32 is not old), using shorthand for frequently used words?
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore
Sent: Monday, July 17, 2006 1:35 AM
To: raj sharma
Cc: bioperl-l
Subject: Re: [Bioperl-l] advice
If you're on a unix type system look at wget -mirror and it's
variations.
B
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma
Sent: Sunday, July 16, 2006 10:22 PM
To: Torsten Seemann
Subject: Re: [Bioperl-l] advice
hi trston
well here i will make u clear my problem
i want to make one data base of marine species u can say this as
mirror of data
so at present whn i click there on line data base of ncbi gets open
so i want to dowload data of marine species (ny one)
nd whn ever i click on tht link local data which i have downloaded
shld open
nd data shld also b updated online after some time
waiting for ur reply
---------------------------------
Do you Yahoo!?
Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
From arareko at campus.iztacala.unam.mx Mon Jul 17 11:31:09 2006
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Mon, 17 Jul 2006 10:31:09 -0500
Subject: [Bioperl-l] advice
In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1>
References: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1>
Message-ID: <44BBAD3D.2040203@campus.iztacala.unam.mx>
Maybe it's a new "obscure" perl6 syntax :)
Ryan Golhar wrote:
> I apologize that this is off-topic, but it is an interesting email.
> Notice the lack of vowels (whn, ny, nd, shld, b) however in other
> words, the vowels are clearly included.
>
> Am I getting old or is "internet spelling" starting to differ from
> "english spelling"? Or is it that the younger generation (not that I'm
> old...a mere 32 is not old), using shorthand for frequently used words?
>
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore
> Sent: Monday, July 17, 2006 1:35 AM
> To: raj sharma
> Cc: bioperl-l
> Subject: Re: [Bioperl-l] advice
>
>
> If you're on a unix type system look at wget -mirror and it's
> variations.
>
> B
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma
> Sent: Sunday, July 16, 2006 10:22 PM
> To: Torsten Seemann
> Subject: Re: [Bioperl-l] advice
>
>
> hi trston
> well here i will make u clear my problem
>
> i want to make one data base of marine species u can say this as
> mirror of data
>
>
> so at present whn i click there on line data base of ncbi gets open
>
> so i want to dowload data of marine species (ny one)
> nd whn ever i click on tht link local data which i have downloaded
> shld open
> nd data shld also b updated online after some time
>
> waiting for ur reply
>
>
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM
From cjfields at uiuc.edu Mon Jul 17 12:09:27 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 11:09:27 -0500
Subject: [Bioperl-l] advice
In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1>
Message-ID: <000a01c6a9bb$6478ba90$15327e82@pyrimidine>
Ha ! I *almost* added something about that. I thought his vowel keys were
broken for a bit, maybe from pounding the keyboard with extreme frustration!
As an aside, doesn't Damian Conway say something about the non-use of vowels
in 'Perl Best Practices?' I think it was in relation to variables,
though...
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Ryan Golhar
> Sent: Monday, July 17, 2006 9:13 AM
> To: 'bioperl-l'
> Subject: Re: [Bioperl-l] advice
>
> I apologize that this is off-topic, but it is an interesting email.
> Notice the lack of vowels (whn, ny, nd, shld, b) however in other
> words, the vowels are clearly included.
>
> Am I getting old or is "internet spelling" starting to differ from
> "english spelling"? Or is it that the younger generation (not that I'm
> old...a mere 32 is not old), using shorthand for frequently used words?
>
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore
> Sent: Monday, July 17, 2006 1:35 AM
> To: raj sharma
> Cc: bioperl-l
> Subject: Re: [Bioperl-l] advice
>
>
> If you're on a unix type system look at wget -mirror and it's
> variations.
>
> B
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma
> Sent: Sunday, July 16, 2006 10:22 PM
> To: Torsten Seemann
> Subject: Re: [Bioperl-l] advice
>
>
> hi trston
> well here i will make u clear my problem
>
> i want to make one data base of marine species u can say this as
> mirror of data
>
>
> so at present whn i click there on line data base of ncbi gets open
>
> so i want to dowload data of marine species (ny one)
> nd whn ever i click on tht link local data which i have downloaded
> shld open
> nd data shld also b updated online after some time
>
> waiting for ur reply
>
>
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Next-gen email? Have it all with the all-new Yahoo! Mail Beta.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Mon Jul 17 12:31:37 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 17 Jul 2006 17:31:37 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
Message-ID: <44BBBB69.6000906@sendu.me.uk>
I see strange node names via Bio::DB::Taxonomy::flatfile:
use Bio::DB::Taxonomy;
my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory =>
$taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile =>
$taxonomy_dir.'names.dmp');
my $tax_id = 89593;
my $node = $db->get_Taxonomy_Node($tax_id);
print "node $tax_id has name '", @{$node->name('common')}, "' and rank
'", $node->rank, "'\n";
Results in:
node 89593 has name 'Craniata ' and rank 'subphylum'
Other examples:
node 2 has name 'Bacteria ' and rank 'superkingdom'
node 1386 has name 'Bacillus ' and rank 'genus'
node 7776 has name 'Gnathostomata ' and rank 'superclass'
etc.
For me the bits in <> are inappropriate and shouldn't be there. The NCBI
website agrees, and you won't see these things if you use -source =>
'entrez'. Should they be removed by the flatfile parser as a matter of
course, with no warnings or option? Or do people want them? Typically
they are just the name of the parent node, so I don't see why anyone
would /need/ them, and I argue it's invalid for parent node information
to be duplicated here.
If there are no objections I'll strip the <> bits. I also plan to make
$node->name('scientific', 'sapiens'); set and get the node name, and
have flatfile and entrez store all common names with
$obj->name('common', 'human', 'man');. As these changes will make the
implementation match the docs I don't see any problems, except that
flatfile users will now find the node name in a different place
(@{$node->name('scientific')} instead of @{$node->name('common')}).
I'll also fix the problem with node names for ranks species and lower,
as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
subspecies/variant names', in the way I suggested there.
If anyone can see a problem with any of these changes, let me know asap.
From hlapp at gmx.net Mon Jul 17 13:53:17 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 17 Jul 2006 13:53:17 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BBBB69.6000906@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk>
Message-ID: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net>
Sound good to me.
BTW NCBI guarantees (well, promises) that there will only be one node
name of class 'scientific'.
-hilmar
On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote:
> I see strange node names via Bio::DB::Taxonomy::flatfile:
>
> use Bio::DB::Taxonomy;
>
> my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory =>
> $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile =>
> $taxonomy_dir.'names.dmp');
>
> my $tax_id = 89593;
> my $node = $db->get_Taxonomy_Node($tax_id);
>
> print "node $tax_id has name '", @{$node->name('common')}, "' and rank
> '", $node->rank, "'\n";
>
> Results in:
> node 89593 has name 'Craniata ' and rank 'subphylum'
>
> Other examples:
> node 2 has name 'Bacteria ' and rank 'superkingdom'
> node 1386 has name 'Bacillus ' and rank 'genus'
> node 7776 has name 'Gnathostomata ' and rank 'superclass'
> etc.
>
> For me the bits in <> are inappropriate and shouldn't be there. The
> NCBI
> website agrees, and you won't see these things if you use -source =>
> 'entrez'. Should they be removed by the flatfile parser as a matter of
> course, with no warnings or option? Or do people want them? Typically
> they are just the name of the parent node, so I don't see why anyone
> would /need/ them, and I argue it's invalid for parent node
> information
> to be duplicated here.
>
> If there are no objections I'll strip the <> bits. I also plan to make
> $node->name('scientific', 'sapiens'); set and get the node name, and
> have flatfile and entrez store all common names with
> $obj->name('common', 'human', 'man');. As these changes will make the
> implementation match the docs I don't see any problems, except that
> flatfile users will now find the node name in a different place
> (@{$node->name('scientific')} instead of @{$node->name('common')}).
>
> I'll also fix the problem with node names for ranks species and lower,
> as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
> subspecies/variant names', in the way I suggested there.
>
> If anyone can see a problem with any of these changes, let me know
> asap.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Mon Jul 17 14:31:08 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 13:31:08 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net>
Message-ID: <001d01c6a9cf$2cf50f60$15327e82@pyrimidine>
I agree. Would be nice to get this to play well with weird bacterial names!
I plan on doing some behind-the-scenes work on Bio::DB::Taxonomy::entrez at
some point soon to test out Bio::DB::EUtilities as the user agent; it
currently uses Bio::Root::HTTPget, I think. Reason I'm doing this is to
quickly get tax info based on any primary ID, primarily for grabbing related
Tax information from the sequence GI w/o parsing the sequence for the TaxID;
this uses NCBI's ELink which I've now implemented.
I'll make sure everything passes tests before I commit.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Monday, July 17, 2006 12:53 PM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Sound good to me.
>
> BTW NCBI guarantees (well, promises) that there will only be one node
> name of class 'scientific'.
>
> -hilmar
>
> On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote:
>
> > I see strange node names via Bio::DB::Taxonomy::flatfile:
> >
> > use Bio::DB::Taxonomy;
> >
> > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory =>
> > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile =>
> > $taxonomy_dir.'names.dmp');
> >
> > my $tax_id = 89593;
> > my $node = $db->get_Taxonomy_Node($tax_id);
> >
> > print "node $tax_id has name '", @{$node->name('common')}, "' and rank
> > '", $node->rank, "'\n";
> >
> > Results in:
> > node 89593 has name 'Craniata ' and rank 'subphylum'
> >
> > Other examples:
> > node 2 has name 'Bacteria ' and rank 'superkingdom'
> > node 1386 has name 'Bacillus ' and rank 'genus'
> > node 7776 has name 'Gnathostomata ' and rank 'superclass'
> > etc.
> >
> > For me the bits in <> are inappropriate and shouldn't be there. The
> > NCBI
> > website agrees, and you won't see these things if you use -source =>
> > 'entrez'. Should they be removed by the flatfile parser as a matter of
> > course, with no warnings or option? Or do people want them? Typically
> > they are just the name of the parent node, so I don't see why anyone
> > would /need/ them, and I argue it's invalid for parent node
> > information
> > to be duplicated here.
> >
> > If there are no objections I'll strip the <> bits. I also plan to make
> > $node->name('scientific', 'sapiens'); set and get the node name, and
> > have flatfile and entrez store all common names with
> > $obj->name('common', 'human', 'man');. As these changes will make the
> > implementation match the docs I don't see any problems, except that
> > flatfile users will now find the node name in a different place
> > (@{$node->name('scientific')} instead of @{$node->name('common')}).
> >
> > I'll also fix the problem with node names for ranks species and lower,
> > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
> > subspecies/variant names', in the way I suggested there.
> >
> > If anyone can see a problem with any of these changes, let me know
> > asap.
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Mon Jul 17 14:09:44 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 17 Jul 2006 19:09:44 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net>
References: <44BBBB69.6000906@sendu.me.uk>
<8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net>
Message-ID: <44BBD268.2060308@sendu.me.uk>
Hilmar Lapp wrote:
>> I also plan to make $node->name('scientific', 'sapiens'); set and
>> get the node name, [...] users will now find the node name in [...]
>> @{$node->name('scientific')}
>
> BTW NCBI guarantees (well, promises) that there will only be one node
> name of class 'scientific'.
Yes, which is why I feel the API for name() isn't ideal, but thought it
would be best to play along. Would having a new scientific_name() method
be better, which gets/sets a single value? Perhaps it could just be a
more 'sane' shorthand to setting @{$node->name('scientific')} to a list
with only the supplied name, and getting ${$node->name('scientific')}[0] ?
From hlapp at gmx.net Mon Jul 17 15:31:55 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 17 Jul 2006 15:31:55 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BBD268.2060308@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk>
<8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net>
<44BBD268.2060308@sendu.me.uk>
Message-ID: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net>
Yes I think $node->scientific_name() as shorthand would be good to
have. Same BTW for $node->common_names() (which would return an array).
-hilmar
On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>>> I also plan to make $node->name('scientific', 'sapiens'); set and
>>> get the node name, [...] users will now find the node name in [...]
>>> @{$node->name('scientific')}
>>
>> BTW NCBI guarantees (well, promises) that there will only be one node
>> name of class 'scientific'.
>
> Yes, which is why I feel the API for name() isn't ideal, but
> thought it
> would be best to play along. Would having a new scientific_name()
> method
> be better, which gets/sets a single value? Perhaps it could just be a
> more 'sane' shorthand to setting @{$node->name('scientific')} to a
> list
> with only the supplied name, and getting ${$node->name
> ('scientific')}[0] ?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Mon Jul 17 16:44:18 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 15:44:18 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net>
Message-ID: <000001c6a9e1$c6b51610$15327e82@pyrimidine>
There was some interest in getting Bio::Species to delegate to
Bio::Taxonomy::Node, so having scientific_name() would help quite a bit
since the name used on the ORGANISM line is the scientific name (well, is
supposed to be; famous last words). Don't know about SwissProt, EMBL, and
others though...
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Monday, July 17, 2006 2:32 PM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Yes I think $node->scientific_name() as shorthand would be good to
> have. Same BTW for $node->common_names() (which would return an array).
>
> -hilmar
>
> On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote:
>
> > Hilmar Lapp wrote:
> >>> I also plan to make $node->name('scientific', 'sapiens'); set and
> >>> get the node name, [...] users will now find the node name in [...]
> >>> @{$node->name('scientific')}
> >>
> >> BTW NCBI guarantees (well, promises) that there will only be one node
> >> name of class 'scientific'.
> >
> > Yes, which is why I feel the API for name() isn't ideal, but
> > thought it
> > would be best to play along. Would having a new scientific_name()
> > method
> > be better, which gets/sets a single value? Perhaps it could just be a
> > more 'sane' shorthand to setting @{$node->name('scientific')} to a
> > list
> > with only the supplied name, and getting ${$node->name
> > ('scientific')}[0] ?
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From vrramnar at student.cs.uwaterloo.ca Mon Jul 17 16:46:32 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Mon, 17 Jul 2006 16:46:32 -0400
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
In-Reply-To: <000501c6a6e8$b9c24d20$15327e82@pyrimidine>
References: <000501c6a6e8$b9c24d20$15327e82@pyrimidine>
Message-ID: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca>
Hi Chris,
1. I have tried changing the database to snp or dbSNP but neither works. It
seems that depending on which type of blast you use(ie, Genome Blast, Blast SNP,
normal blast such as blastn, etc...) you see a different listing of databases
available for querys. Since you mention that the Blast page I see was generated
by Genome, where could I go to see a complete listing of databases I can query??
Or if you knew off hand which database to search if I only wanted dbSNP hits?
2. You also mention, I can limit the search by using Entrez terms. Do you mean
like:
$Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc';
where 'abc' is the name of the subject with which you would only like to see
result of. For example if you put it as 'Homo sapiens[Organism]' then only human
sequences would be in hit lists.
If this is what you mean, what would I change it to, to see only hits from
dbSNP?
Thanks for the ongoing help,
Rohan
Quoting Chris Fields :
> I added a method to RemoteBlast in bioperl-live (CVS) if you want to play
> with changing the URL. I have been thinking about doing this for a bit now
> but I already see problems.
>
> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note
> the differences in the URL) but a user-friendly request page, generated on
> the fly by Genome, to submit BLAST requests for the relevant database. So
> changing the URL will not work (even by adding extra parameters); you only
> get the original HTML web page.
>
> You could try changing the database or limiting the search using an Entrez
> term (which you should be able to include in the request, probably by adding
> it to the HEADER).
>
> Chris
>
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca
> > Sent: Thursday, July 13, 2006 5:39 PM
> > To: bioperl-l at lists.open-bio.org
> > Subject: [Bioperl-l] Remote Blast - Blast Human Genome
> >
> >
> > Hello Again,
> >
> > I have another question regarding Remote blast but this time using Genome
> > Blast.
> >
> > Here is the link:
> >
> > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
> >
> > which again uses the main Blast web site:
> >
> > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
> >
> > Again I am not sure what to add or what HEADER information to change
> > within my
> > script.
> >
> > Here is my program, which was the same as the last email:
> >
> > #!/usr/bin/perl -w
> >
> > use Bio::Perl;
> > use Bio::Tools::Run::RemoteBlast;
> >
> > my $prog = "blastn";
> > my $db = "refseq_genomic";
> > my $e_val = 0.01;
> >
> > my @params = ( '-prog' => $prog,
> > '-data' => $db,
> > '-expect' => $e_val);
> >
> > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
> > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-----
> > what
> > do I put here
> > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need
> > to add
> > any other values to the form inputs
> >
> > $factory->submit_blast("blast.in");
> > $v = 1;
> >
> > while (my @rids = $factory->each_rid)
> > { foreach my $rid ( @rids )
> > { my $rc = $factory->retrieve_blast($rid);
> > if( !ref($rc) )
> > { if( $rc < 0 )
> > { $factory->remove_rid($rid);
> > }
> > print STDERR "." if ( $v > 0 );
> > sleep 5;
> > }
> > else
> > { my $result = $rc->next_result();
> > my $filename = $result->query_name()."\.out";
> > $factory->save_output($filename);
> > $factory->remove_rid($rid);
> > print "\nQuery Name: ", $result->query_name(), "\n";
> > }
> > }
> > }
> >
> >
> > Both of my questions are very similiar as in I know how to use remote
> > blast but
> > not sure what to change to access the specific blast I want.
> >
> > Again, any help would be very appreciated!!
> >
> > Rohan
> >
> >
> >
> > ----------------------------------------
> > This mail sent through www.mywaterloo.ca
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
----------------------------------------
This mail sent through www.mywaterloo.ca
From cjfields at uiuc.edu Mon Jul 17 17:25:54 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 16:25:54 -0500
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
In-Reply-To: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca>
Message-ID: <001001c6a9e7$962b56c0$15327e82@pyrimidine>
Okay, I think I may know what's going on a little more now with NCBI's BLAST
interface. Looks like any NCBI BLAST query must use the default URL and so
must set up to proper GET/PUT commands to retrieve everything correctly.
Here's the API description for it all:
http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
You could try setting the database to 'snp' or something along those lines
instead of 'nr'; or you could see what the name of the database is when you
use the web form and try setting it to that. According to this page, this
should be possible:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.section.SearchdbSNP
_test._Search_dbSNP_Using_B
The Entrez Query limit was a recommendation for limiting your search to a
set of sequences for human, for instance.
I'll try looking into it a bit more but I'm pretty busy. If you find
anything out you should probably post it here .
Chris
> Hi Chris,
>
> 1. I have tried changing the database to snp or dbSNP but neither works.
> It
> seems that depending on which type of blast you use(ie, Genome Blast,
> Blast SNP,
> normal blast such as blastn, etc...) you see a different listing of
> databases
> available for querys. Since you mention that the Blast page I see was
> generated
> by Genome, where could I go to see a complete listing of databases I can
> query??
> Or if you knew off hand which database to search if I only wanted dbSNP
> hits?
>
> 2. You also mention, I can limit the search by using Entrez terms. Do you
> mean
> like:
> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc';
> where 'abc' is the name of the subject with which you would only like to
> see
> result of. For example if you put it as 'Homo sapiens[Organism]' then only
> human
> sequences would be in hit lists.
> If this is what you mean, what would I change it to, to see only hits from
> dbSNP?
>
> Thanks for the ongoing help,
>
> Rohan
>
> Quoting Chris Fields :
>
> > I added a method to RemoteBlast in bioperl-live (CVS) if you want to
> play
> > with changing the URL. I have been thinking about doing this for a bit
> now
> > but I already see problems.
> >
> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page
> (note
> > the differences in the URL) but a user-friendly request page, generated
> on
> > the fly by Genome, to submit BLAST requests for the relevant database.
> So
> > changing the URL will not work (even by adding extra parameters); you
> only
> > get the original HTML web page.
> >
> > You could try changing the database or limiting the search using an
> Entrez
> > term (which you should be able to include in the request, probably by
> adding
> > it to the HEADER).
> >
> > Chris
> >
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > bounces at lists.open-bio.org] On Behalf Of
> vrramnar at student.cs.uwaterloo.ca
> > > Sent: Thursday, July 13, 2006 5:39 PM
> > > To: bioperl-l at lists.open-bio.org
> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome
> > >
> > >
> > > Hello Again,
> > >
> > > I have another question regarding Remote blast but this time using
> Genome
> > > Blast.
> > >
> > > Here is the link:
> > >
> > >
> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
> > >
> > > which again uses the main Blast web site:
> > >
> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
> > >
> > > Again I am not sure what to add or what HEADER information to change
> > > within my
> > > script.
> > >
> > > Here is my program, which was the same as the last email:
> > >
> > > #!/usr/bin/perl -w
> > >
> > > use Bio::Perl;
> > > use Bio::Tools::Run::RemoteBlast;
> > >
> > > my $prog = "blastn";
> > > my $db = "refseq_genomic";
> > > my $e_val = 0.01;
> > >
> > > my @params = ( '-prog' => $prog,
> > > '-data' => $db,
> > > '-expect' => $e_val);
> > >
> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <--
> ---
> > > what
> > > do I put here
> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I
> need
> > > to add
> > > any other values to the form inputs
> > >
> > > $factory->submit_blast("blast.in");
> > > $v = 1;
> > >
> > > while (my @rids = $factory->each_rid)
> > > { foreach my $rid ( @rids )
> > > { my $rc = $factory->retrieve_blast($rid);
> > > if( !ref($rc) )
> > > { if( $rc < 0 )
> > > { $factory->remove_rid($rid);
> > > }
> > > print STDERR "." if ( $v > 0 );
> > > sleep 5;
> > > }
> > > else
> > > { my $result = $rc->next_result();
> > > my $filename = $result->query_name()."\.out";
> > > $factory->save_output($filename);
> > > $factory->remove_rid($rid);
> > > print "\nQuery Name: ", $result->query_name(), "\n";
> > > }
> > > }
> > > }
> > >
> > >
> > > Both of my questions are very similiar as in I know how to use remote
> > > blast but
> > > not sure what to change to access the specific blast I want.
> > >
> > > Again, any help would be very appreciated!!
> > >
> > > Rohan
> > >
> > >
> > >
> > > ----------------------------------------
> > > This mail sent through www.mywaterloo.ca
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
>
>
>
> ----------------------------------------
> This mail sent through www.mywaterloo.ca
From bix at sendu.me.uk Mon Jul 17 17:33:26 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 17 Jul 2006 22:33:26 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <000001c6a9e1$c6b51610$15327e82@pyrimidine>
References: <000001c6a9e1$c6b51610$15327e82@pyrimidine>
Message-ID: <44BC0226.1080605@sendu.me.uk>
Chris Fields wrote:
> There was some interest in getting Bio::Species to delegate to
> Bio::Taxonomy::Node, so having scientific_name() would help quite a bit
> since the name used on the ORGANISM line is the scientific name (well, is
> supposed to be; famous last words).
Can you clarify exactly what you mean here? Preferably with an example?
ORGANISM line of which file format?
The reason I ask is that I still feel we need to do parsing of the names
for species rank and lower:
# The 'scientific name' for humans could be considered to be 'Homo sapiens'.
# Taxid 9606 in the NCBI taxonomy database has rank 'species' and
ScientificName 'Homo sapiens'.
# For sanity, Bio::*Taxonomy* likes to interpret this ScientificName as
'sapiens' so that the genus is not held redundantly. It provides a
binomial() method to give you 'Homo sapiens' again if you want it.
# I plan on maintaining this; scientific_name() would give you the
non-redundant sibling-unique name 'sapiens'. binomial() on a species
rank and lower would give you 'Homo sapiens' (presumably grabbing the
'Homo' from the parent node with rank 'genus', or similar).
Good, bad or ugly? I would prefer it works like this and we agree to
differ with NCBI on what the 'scientific name' of a species node should
be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling
binomial() (which I propose will actually give the correct answer, even
for bacteria and viruses).
Perhaps the short-hand (and the classifier used in name()) shouldn't
mention the word 'scientific' to avoid confusion? But a) what else would
we call it?, and b) for all ranks above species it /is/ the scientific name.
From hlapp at gmx.net Mon Jul 17 19:47:24 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 17 Jul 2006 19:47:24 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC0226.1080605@sendu.me.uk>
References: <000001c6a9e1$c6b51610$15327e82@pyrimidine>
<44BC0226.1080605@sendu.me.uk>
Message-ID: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net>
I don't think we should differ from NCBI in places where the
connection between a method name and the NCBI data file is obvious or
otherwise we will confuse people and send them into traps.
$node->scientific_name() should simply report what NCBI reports. For
simple species this will be identical to what $node->binomial()
returns, but for others it may not, e.g., strains, varieties, etc or
the weird world of viri and bacteria.
This will also absolve us from retaining the business logic for how
to construct the scientific name from genus, species, and possibly
strain or whatever.
binomial() isn't part of the NCBI taxonomy definition, so you have
freedom there to report what suits you.
-hilmar
On Jul 17, 2006, at 5:33 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> There was some interest in getting Bio::Species to delegate to
>> Bio::Taxonomy::Node, so having scientific_name() would help quite
>> a bit
>> since the name used on the ORGANISM line is the scientific name
>> (well, is
>> supposed to be; famous last words).
>
> Can you clarify exactly what you mean here? Preferably with an
> example?
> ORGANISM line of which file format?
> The reason I ask is that I still feel we need to do parsing of the
> names
> for species rank and lower:
>
> # The 'scientific name' for humans could be considered to be 'Homo
> sapiens'.
> # Taxid 9606 in the NCBI taxonomy database has rank 'species' and
> ScientificName 'Homo sapiens'.
> # For sanity, Bio::*Taxonomy* likes to interpret this
> ScientificName as
> 'sapiens' so that the genus is not held redundantly. It provides a
> binomial() method to give you 'Homo sapiens' again if you want it.
> # I plan on maintaining this; scientific_name() would give you the
> non-redundant sibling-unique name 'sapiens'. binomial() on a species
> rank and lower would give you 'Homo sapiens' (presumably grabbing the
> 'Homo' from the parent node with rank 'genus', or similar).
>
> Good, bad or ugly? I would prefer it works like this and we agree to
> differ with NCBI on what the 'scientific name' of a species node
> should
> be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling
> binomial() (which I propose will actually give the correct answer,
> even
> for bacteria and viruses).
>
> Perhaps the short-hand (and the classifier used in name()) shouldn't
> mention the word 'scientific' to avoid confusion? But a) what else
> would
> we call it?, and b) for all ranks above species it /is/ the
> scientific name.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From osborne1 at optonline.net Mon Jul 17 20:52:04 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Mon, 17 Jul 2006 20:52:04 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC0226.1080605@sendu.me.uk>
Message-ID:
Sendu,
The string "sapiens" is not what a biology textbook would call a scientific
name. You're going to have to respect decades of convention and have
scientific_name() return the genus and species name.
Brian O.
On 7/17/06 5:33 PM, "Sendu Bala" wrote:
> # I plan on maintaining this; scientific_name() would give you the
> non-redundant sibling-unique name 'sapiens'. binomial() on a species
> rank and lower would give you 'Homo sapiens' (presumably grabbing the
> 'Homo' from the parent node with rank 'genus', or similar).
From cjfields at uiuc.edu Mon Jul 17 21:36:12 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 20:36:12 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC0226.1080605@sendu.me.uk>
References: <000001c6a9e1$c6b51610$15327e82@pyrimidine>
<44BC0226.1080605@sendu.me.uk>
Message-ID: <1345AB61-E7AB-447A-AB40-2170244404B2@uiuc.edu>
On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> There was some interest in getting Bio::Species to delegate to
>> Bio::Taxonomy::Node, so having scientific_name() would help quite
>> a bit
>> since the name used on the ORGANISM line is the scientific name
>> (well, is
>> supposed to be; famous last words).
>
> Can you clarify exactly what you mean here? Preferably with an
> example?
> ORGANISM line of which file format?
> The reason I ask is that I still feel we need to do parsing of the
> names
> for species rank and lower:
Sorry, should have clarified; GenBank sequence format. Here's the link:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
The ORGANISM annotation line for a GenBank record contains the formal
scientific name for the organism along with the lineage. I believe
SwissProt/EMBL and several other RichSeq formats do the same. The
lineage that is also present is almost always abbreviated, so it's
not always possible to determine the formal rankings strictly from
the file with any real degree of reliability (hence the past problems
with Bio::Species).
>
> # The 'scientific name' for humans could be considered to be 'Homo
> sapiens'.
> # Taxid 9606 in the NCBI taxonomy database has rank 'species' and
> ScientificName 'Homo sapiens'.
> # For sanity, Bio::*Taxonomy* likes to interpret this
> ScientificName as
> 'sapiens' so that the genus is not held redundantly. It provides a
> binomial() method to give you 'Homo sapiens' again if you want it.
> # I plan on maintaining this; scientific_name() would give you the
> non-redundant sibling-unique name 'sapiens'. binomial() on a species
> rank and lower would give you 'Homo sapiens' (presumably grabbing the
> 'Homo' from the parent node with rank 'genus', or similar).
I think you should use scientific_name to designate the full formal
scientific name for an organism according to the way NCBI describes
it for that particular node (nothing more, except removing the <>
stuff you mentioned earlier) and as it would appear for the ORGANISM
line. Otherwise you'll run into serious species/subspecies/strain
headaches (see below). If you want real genus/species (i.e. nothing
extra, like strains or subspecies), separate them out and store them
using a genus/species get/set if possible; the binomial them will
give back the two name genus species designation.
Here are a couple of example ones in (this is in XML, using
EUtilities). These were retrieved using NCBI TaxIDs using Elink from
a list of protein GI's (~700 of them total), so represent the actual
NCBI TaxID linked with the sequence file. If you try breaking these
apart into species, what happens to the strain/subspecies stuff?
Notice that many of these nodes, which come directly from protein
GI's, also have no rank.
...
376686
Flavobacterium johnsoniae UW101
Flavobacterium johnsoniae NBRC 14942
Flavobacterium johnsoniae IFO 14942
Flavobacterium johnsoniae IAM 14304
Flavobacterium johnsoniae MYX.1.1.1
Flavobacterium johnsoniae NCIB 11054
Flavobacterium johnsoniae DSM 2064
Flavobacterium johnsoniae LMG 1341
Flavobacterium johnsoniae ATCC 17061
Flavobacterium johnsoniae strain UW101
EquivalentName>
Flavobacterium johnsoniae str. UW101
EquivalentName>
986
no rank
Bacteria
...
370552
Streptococcus pyogenes MGAS10270
Streptococcus pyogenes strain MGAS10270
EquivalentName>
Streptococcus pyogenes str. MGAS10270
EquivalentName>
301448
no rank
Bacteria
...
224308
Bacillus subtilis subsp. subtilis str. 168
ScientificName>
Bacillus subtilis subsp. subtilis 168
135461
no rank
Bacteria
> Good, bad or ugly? I would prefer it works like this and we agree to
> differ with NCBI on what the 'scientific name' of a species node
> should
> be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling
> binomial() (which I propose will actually give the correct answer,
> even
> for bacteria and viruses).
This is where I would strongly disagree (though I agree that the way
NCBI uses 'scientific name' is a bit off).
We are using the NCBI tax database, anf as such we are somewhat at
the mercy of the NCBI tax nomenclature, unfortunately.
If NCBI decides to change their official definition for the
scientific name to something that made a bit more sense, the XML and
dump data will reflect that and we won't have many problems adapting
since the scientific name will always conform to their definition.
But if we split the information up ad hoc then we are bound for
disaster; it's just way too much headache to worry about. We could
always point to the official NCBI definition as the one we adopt and
then assign the tagged information from the node directly to
scientific_name (no globbing together at all). Bio::Species could
delegate likewise fro the ORGANISM line, so there's no piecemeal
attempts to get Humpty Dumpty to fit back together again.
You could go through and get the lineage from the XML/dump file data
and try to sort the genus/species out, then paste it all back
together (fingers crossed!), but I think it's more headache than it's
worth to split these up, then hope that you can paste them back
together again and always expect to get the same results.
Chris
> Perhaps the short-hand (and the classifier used in name()) shouldn't
> mention the word 'scientific' to avoid confusion? But a) what else
> would
> we call it?, and b) for all ranks above species it /is/ the
> scientific name.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Mon Jul 17 21:55:28 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 20:55:28 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
References:
Message-ID: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu>
I agree with Hilmar's assessment, not b/c I disagree with your
definition of scientific name or the reasoning Sendu proposes. I
think we are somewhat bound to NCBI's nomenclature for their tax
database. If we veer away from NCBI's definition for 'scientific
name' it will just confuse users and lead to more trouble than it's
worth, frankly. If we stick with it then any changes NCBI makes
should be easier to deal with.
Leaving the scientific_name as NCBI designates it, though it probably
disagrees with ~99% of the world's textbooks, may be the most
maintainable solution.
Now, binomial() on the other hand...
Chris
On Jul 17, 2006, at 7:52 PM, Brian Osborne wrote:
> Sendu,
>
> The string "sapiens" is not what a biology textbook would call a
> scientific
> name. You're going to have to respect decades of convention and have
> scientific_name() return the genus and species name.
>
> Brian O.
>
>
> On 7/17/06 5:33 PM, "Sendu Bala" wrote:
>
>> # I plan on maintaining this; scientific_name() would give you the
>> non-redundant sibling-unique name 'sapiens'. binomial() on a species
>> rank and lower would give you 'Homo sapiens' (presumably grabbing the
>> 'Homo' from the parent node with rank 'genus', or similar).
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From hlapp at gmx.net Mon Jul 17 22:06:01 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 17 Jul 2006 22:06:01 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu>
References:
<07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu>
Message-ID:
On Jul 17, 2006, at 9:55 PM, Chris Fields wrote:
> Leaving the scientific_name as NCBI designates it, though it probably
> disagrees with ~99% of the world's textbooks, may be the most
> maintainable solution.
It doesn't disagree, it's quite like what the world's textbooks give
you as a 'scientific name'.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Tue Jul 18 00:24:50 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 17 Jul 2006 23:24:50 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
References:
<07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu>
Message-ID: <7BCA093B-90FB-4B0A-91FD-A6E0B34C96DD@uiuc.edu>
When you mean genus-species, which would be yes. But parent nodes?
If you trust WIkipedia, the scientific name == binomial
nomenclature. Which could mean no subspecies, strains, etc if one
were to be really strict about it, though that may be a grey area;
I'm no taxonomist.
http://en.wikipedia.org/wiki/Scientific_name
The parent nodes shouldn't have a scientific name if one were to
adhere strictly to the standard definition above, but NCBI refers to
the names for the parent nodes as 'scientific name' (the XML element
is still ScientificName, just like the child node). I'm not sure
what the tax dump file is, though, so that may be different. Here's
the lineage for Taxid 312284 (marine actinobacterium PHSC20C1). I
cut out the irrelevant bits and just show the lineage with all the
parent nodes, taxID, and rank:
131567
cellular organisms
no rank
2
Bacteria
superkingdom
201174
Actinobacteria
phylum
1760
Actinobacteria (class)
class
52018
unclassified Actinobacteria
no rank
78537
unclassified Actinobacteria (miscellaneous)
ScientificName>
no rank
....
Seems to me the easiest thing to do here, when looking at a
particular node, is to use scientific_name() to hold that particular
element for the node and have binomial represent the true 'scientific
name', much as Sendu proposed. It would also make life much easier
when parsing GenBank/SwissProt/EMBL (SeqIO) to have the data
designating the formal scientific name (according to NCBI) be
assigned to a scientific_name() get/set method in Bio::Species for
later writing; then if we want to delegate this over to
Bio::Taxonomy::Node from Bio::Species it would be that much easier.
This would also get around some of the problems I have been seeing
with bacterial names when passing GenBank data through SeqIO, since
you wouldn't be required to glop the name together from the way
Bio::Species tried to guess the lineage.
Chris
On Jul 17, 2006, at 9:06 PM, Hilmar Lapp wrote:
>
> On Jul 17, 2006, at 9:55 PM, Chris Fields wrote:
>
>> Leaving the scientific_name as NCBI designates it, though it probably
>> disagrees with ~99% of the world's textbooks, may be the most
>> maintainable solution.
>
> It doesn't disagree, it's quite like what the world's textbooks give
> you as a 'scientific name'.
>
> -hilmar
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Tue Jul 18 03:27:49 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 18 Jul 2006 08:27:49 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net>
References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk>
<64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net>
Message-ID: <44BC8D75.1080806@sendu.me.uk>
Hilmar Lapp wrote:
> I don't think we should differ from NCBI in places where the
> connection between a method name and the NCBI data file is obvious or
> otherwise we will confuse people and send them into traps.
>
> $node->scientific_name() should simply report what NCBI reports. For
> simple species this will be identical to what $node->binomial()
> returns, but for others it may not, e.g., strains, varieties, etc or
> the weird world of viri and bacteria.
Ok, well this certainly seems to be consensus so I'll abide.
> This will also absolve us from retaining the business logic for how
> to construct the scientific name from genus, species, and possibly
> strain or whatever.
What about the existing genus(), species(), sub_species() and variant()
methods? There would be no need for any logic to join things together,
but I would still like to be able to get just 'sapiens' from somewhere.
Can I use species() for that purpose (though again, species is strictly
'Homo sapiens')? Likewise sub_species() and variant() could hold the
remaining non-redundant names. Or should all of these be deprecated
because they don't really have a place in a generic Node class?
What about node_name()? Yet another synonym of scientific_name? (right
now it grabs the common name(s)). Ugh.
What should I do with the classification array? Should it hold the raw
ScientificName like:
join(',', $node->classification) eq 'Homo sapiens, Homo,
Homo/Pan/Gorilla group [...]'?
Or should it be like:
join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla
group [...]'?
The latter is how it currently works (when it works correctly); I would
rather fix it than lose the logic completely, but if we're staying true
to proper classification (vs. what a programmer might expect), I guess I
must use the raw ScientificName?
> binomial() isn't part of the NCBI taxonomy definition, so you have
> freedom there to report what suits you.
I don't think binomial() would serve any useful purpose now, however. I
can either deprecate it or make it a synonym of scientific_name() or
both. Or binomial() can be a version of scientific_name() that complains
if you use it on a rank higher or lower than species. As for species()
et al., it may have no place in a generic Node class. Thoughts?
From bix at sendu.me.uk Tue Jul 18 04:43:43 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 18 Jul 2006 09:43:43 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BBBB69.6000906@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk>
Message-ID: <44BC9F3F.2040500@sendu.me.uk>
Sendu Bala wrote:
[snip proposed changes to Bio::DB::Taxonomy::* and Bio::Taxonomy::Node]
> If anyone can see a problem with any of these changes, let me know asap.
I've just realised that there are currently no tests for
Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped.
Node doesn't get an especially thorough work-out either (in the skipped
section).
I'm guessing it's not feasible to include the full taxdump from NCBI
(~40MB) in t/data... do people think it would be reasonable to create
some sort of small subset of the data? I could just pull out the lines
from names.dmp and nodes.dmp relevant to a few example organisms. Say,
for human and a tricky bacteria and virus?
For the purposes of running the test, where should the index files be
kept? In t/data with the .dmp files or in /tmp? Should the test script
delete them afterwards, or leave them be?
The entrez tests are skipped to 'avoid blocking', but the test only
makes 2 entrez queries with a sleep(3) in-between. Basically, I don't
think there's ever any reason to skip. Shall I remove the skip? Lots of
other database-accessing tests in the test suite just go right ahead and
access their database, no problem.
Cheers,
Sendu.
From torsten.seemann at infotech.monash.edu.au Mon Jul 17 23:53:02 2006
From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann)
Date: Tue, 18 Jul 2006 13:53:02 +1000
Subject: [Bioperl-l] advice
In-Reply-To: <000a01c6a9bb$6478ba90$15327e82@pyrimidine>
References: <000a01c6a9bb$6478ba90$15327e82@pyrimidine>
Message-ID: <44BC5B1E.5080600@infotech.monash.edu.au>
> Ha ! I *almost* added something about that. I thought his vowel keys were
> broken for a bit, maybe from pounding the keyboard with extreme frustration!
The wide variety of pronunciation of English around the world can be
mostly blamed on those damned vowels... so perhaps removing them helps
one to reach a wider audience :-)
> As an aside, doesn't Damian Conway say something about the non-use of vowels
> in 'Perl Best Practices?' I think it was in relation to variables,
> though...
Yeah, on page 46 he says NOT to remove vowels in variable names, use
prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff.
(Actually, I studied at Monash University under Damian Conway, and
recall his ridiculing of Perl, so I found it kind of ironic that he
ended up changing the Perl landscape so significantly! He even wrote an
internal publication "theStyle - a guide to C programming style" in
about 1990 in which he violates some of his later Perl Best Practices :-)
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
From sharma.animesh at gmail.com Tue Jul 18 03:58:41 2006
From: sharma.animesh at gmail.com (Animesh Sharma)
Date: Tue, 18 Jul 2006 13:28:41 +0530
Subject: [Bioperl-l] PDB file parser (Separates chain-sequence and
chain-structure)
Message-ID: <156674e60607180058r653fa8fesbc654508c9c19b5b@mail.gmail.com>
Hi Chris,
I have written a small script to separate the Chain in a PDB file. It stores
the sequence (fasta format) and structure (pdb format) in separate files
with middle name according to the Chain it contains. If the PDB file has
only one chain, it creates a file with default as middle name.
Eg,
perl pdb_chain_extract.pl 1HCO.pdb
Will create 4 files with names:
1HCO.A.fas ( Sequence of Chain A in fasta format)
1HCO.A.pdb ( Structure of Chain A in pdb format)
1HCO.B.fas ( Sequence of Chain B in fasta format)
1HCO.B.pdb ( Sequence of Chain B in pdb format)
.I wrote it in the spirit of your example script given @
http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/examples/structure/structure-io.pl?rev=1.2&content-type=text/vnd.viewcvs-markupCan
this be included in the example scripts too?
Thanks and regards,
Animesh
--
______________________"The Answer Lies in Genome"______________________
http://fuzzylife.org/animesh/
+919868580004
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdb_chain_extract.pl
Type: application/octet-stream
Size: 2593 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060718/9e98ece2/attachment.obj
From bix at sendu.me.uk Tue Jul 18 09:20:34 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 18 Jul 2006 14:20:34 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BCAE08.8070307@ebi.ac.uk>
References: <44BCAE08.8070307@ebi.ac.uk>
Message-ID: <44BCE022.5000502@sendu.me.uk>
I thought I'd post this here incase anyone wants to discuss the points
Nadeem brings up. As far as I can see it is acceptable to remove the <>
bits so I still plan to do so.
Nadeem Faruque wrote: [off-list, posted here with permission]
> In case you didn't realise, odd node names such as 'Gnathostomata
> ' are created to uniquify some tax nodes that have identical
> scientific names, eg there are 8 entries for Rhodotorula.
>
> When we parse the ncbi tax dump we store this column as UNIQUE_NAME but
> I don't think that we actually use it for anything at within EMBL
> nucleotide sequence bank.
[...]
> Also, I note that there are 548 non-unique NAME_TXT of class 'scientific
> name', so the UNIQUE_NAME column may be of use to someone (though given
> the strength of using a taxid directly I don't see why you'd want to).
Indeed. And given that we are building a taxonomy with nodes, it doesn't
matter that two different nodes in the entire taxonomy tree share the
same name - the position in the tree implicitly is something unique. So
if you find yourself with a node called 'Rhodotorula' you can find out
which one it is by looking at the closest ranked parent.
That said, for 'Rhodotorula ' the closest ranked
parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a
problem? Do we need to care about this word 'Sporidiobolaceae' that is
effectively just a synonym of 'Sporidiobolales'?
[Nadeem later replied "...I can't imagine the <> value to be of any
use.". He also clarified that if species have identical names and you
store those, you can't work out what the corresponding taxid is. Without
the <> bit you need some other information, like the classification. I
think this other information will be present in input file formats and
it must be up to the user to store the extra when outputting from bioperl]
From osborne1 at optonline.net Tue Jul 18 10:50:48 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Tue, 18 Jul 2006 10:50:48 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC9F3F.2040500@sendu.me.uk>
Message-ID:
Sendu,
The idea to create mini *dmp files is a good one, I think. With respect to
temporary files I'm fairly sure that most tests that use them create them
some where in t/data and then delete them after.
Brian O.
On 7/18/06 4:43 AM, "Sendu Bala" wrote:
> (~40MB) in t/data... do people think it would be reasonable to create
> some sort of small subset of the data? I could just pull out the lines
> from names.dmp and nodes.dmp relevant to a few example organisms. Say,
> for human and a tricky bacteria and virus?
> For the purposes of running the test, where should the index files be
> kept? In t/data with the .dmp files or in /tmp? Should the test script
> delete them afterwards, or leave them be?
From cjfields at uiuc.edu Tue Jul 18 11:44:07 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 10:44:07 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC8D75.1080806@sendu.me.uk>
Message-ID: <003201c6aa81$01db9a30$15327e82@pyrimidine>
> What about the existing genus(), species(), sub_species() and variant()
> methods? There would be no need for any logic to join things together,
> but I would still like to be able to get just 'sapiens' from somewhere.
> Can I use species() for that purpose (though again, species is strictly
> 'Homo sapiens')? Likewise sub_species() and variant() could hold the
> remaining non-redundant names. Or should all of these be deprecated
> because they don't really have a place in a generic Node class?
This is where Hilmar suggests that you have a bit of freedom in doing what
you want, as with binomial(). So species() should return species
('sapiens'), genus return genus, etc.
At that level there will need to be some additional data munging since the
ranks below species seem to include the entire name, not just the species.
But this could be done from the lineage if all nodes are present and tagged
as such.
> What about node_name()? Yet another synonym of scientific_name? (right
> now it grabs the common name(s)). Ugh.
I agree things need cleaning up. You could always make node_name() an alias
for scientific_name() though it could just be deprecated.
> What should I do with the classification array? Should it hold the raw
> ScientificName like:
> join(',', $node->classification) eq 'Homo sapiens, Homo,
> Homo/Pan/Gorilla group [...]'?
> Or should it be like:
> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla
> group [...]'?
Don't know what the dump file gives; the XML output using efetch via entrez
has the raw lineage (as appears in a GenBank sequence file) and the actual
full lineage with TaxID, rank, 'scientific name,' in the actual lineage
order. I think one problem area will be the 'no rank' designations in the
lineage. Note that the below example also has a species and no genus;
tricky!
312284
marine actinobacterium PHSC20C1
marine actinobacterium strain PHSC20C1
marine actinobacterium str. PHSC20C1
78537
species
Bacteria
...
cellular organisms; Bacteria; Actinobacteria; Actinobacteria
(class); unclassified Actinobacteria; unclassified Actinobacteria
(miscellaneous)
131567
cellular organisms
no rank
2
Bacteria
superkingdom
201174
Actinobacteria
phylum
1760
Actinobacteria (class)
class
52018
unclassified Actinobacteria
no rank
78537
unclassified Actinobacteria
(miscellaneous)
no rank
> The latter is how it currently works (when it works correctly); I would
> rather fix it than lose the logic completely, but if we're staying true
> to proper classification (vs. what a programmer might expect), I guess I
> must use the raw ScientificName?
>
> > binomial() isn't part of the NCBI taxonomy definition, so you have
> > freedom there to report what suits you.
>
> I don't think binomial() would serve any useful purpose now, however. I
> can either deprecate it or make it a synonym of scientific_name() or
> both. Or binomial() can be a version of scientific_name() that complains
> if you use it on a rank higher or lower than species. As for species()
> et al., it may have no place in a generic Node class. Thoughts?
The use of scientific_name() in this context would be more to conform with
what NCBI defines it as rather than as the actual definition; this should be
explicitly stated as such in POD and is more for long-term maintainability.
No matter what is done here, you will have some degree of confusion: those
who want strict adherence to the term 'scientific name' and those who want
the method to conform to NCBI's definition. Better to document the
reasoning for it in some way that risk the random masses complaining.
We could use binomial() for the 'scientific name' as the rest of the world
knows it (as in binomial nomenclature), having it built from genus-species
like you had originally suggested. That's what Hilmar suggested as an
'experimental' area of sorts, since NCBI doesn't use that particular term in
its taxonomy definition.
Chris
From cjfields at uiuc.edu Tue Jul 18 11:48:36 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 10:48:36 -0500
Subject: [Bioperl-l] advice
In-Reply-To: <44BC5B1E.5080600@infotech.monash.edu.au>
Message-ID: <003301c6aa81$a34fd8e0$15327e82@pyrimidine>
Guess Dr. Conway became a Perl convert. The reviews of the book state that
the 'best practices' really come from his experience as a Perl programmer
over the last couple of decades, so maybe he learned something since 1990.
Chris
> > Ha ! I *almost* added something about that. I thought his vowel keys
> were
> > broken for a bit, maybe from pounding the keyboard with extreme
> frustration!
>
> The wide variety of pronunciation of English around the world can be
> mostly blamed on those damned vowels... so perhaps removing them helps
> one to reach a wider audience :-)
>
> > As an aside, doesn't Damian Conway say something about the non-use of
> vowels
> > in 'Perl Best Practices?' I think it was in relation to variables,
> > though...
>
> Yeah, on page 46 he says NOT to remove vowels in variable names, use
> prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff.
>
> (Actually, I studied at Monash University under Damian Conway, and
> recall his ridiculing of Perl, so I found it kind of ironic that he
> ended up changing the Perl landscape so significantly! He even wrote an
> internal publication "theStyle - a guide to C programming style" in
> about 1990 in which he violates some of his later Perl Best Practices :-)
>
> --
> Dr Torsten Seemann http://www.vicbioinformatics.com
> Victorian Bioinformatics Consortium, Monash University, Australia
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Tue Jul 18 12:05:48 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 11:05:48 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BC9F3F.2040500@sendu.me.uk>
Message-ID: <003401c6aa84$08ff6c80$15327e82@pyrimidine>
> I've just realised that there are currently no tests for
> Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped.
> Node doesn't get an especially thorough work-out either (in the skipped
> section).
>
> I'm guessing it's not feasible to include the full taxdump from NCBI
> (~40MB) in t/data... do people think it would be reasonable to create
> some sort of small subset of the data? I could just pull out the lines
> from names.dmp and nodes.dmp relevant to a few example organisms. Say,
> for human and a tricky bacteria and virus?
> For the purposes of running the test, where should the index files be
> kept? In t/data with the .dmp files or in /tmp? Should the test script
> delete them afterwards, or leave them be?
I would place a small section in t/data or several individual examples in a
subdirectory thereof (t/data/taxonomy).
> The entrez tests are skipped to 'avoid blocking', but the test only
> makes 2 entrez queries with a sleep(3) in-between. Basically, I don't
> think there's ever any reason to skip. Shall I remove the skip? Lots of
> other database-accessing tests in the test suite just go right ahead and
> access their database, no problem.
Depends on whether there is someone out there who doesn't have a network
connection (and there always is). The DB.t tests skip based on testing for
the env. variable BIOPERLDEBUG.
1..121
ok 1 # Skipping tests which require remote servers - set env variable
BIOPERLDEBUG to test
You could always do something along those lines or add a test for a network
connection using an eval block and skip the tests if the network test fails,
but there you run the risk of the tests failing not b/c of code problems but
from remote server issues; I've seen this happen with SwissProt and GenBank
testing before during peak hours.
Chris
> Cheers,
> Sendu.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Tue Jul 18 13:03:54 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 18 Jul 2006 18:03:54 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003201c6aa81$01db9a30$15327e82@pyrimidine>
References: <003201c6aa81$01db9a30$15327e82@pyrimidine>
Message-ID: <44BD147A.9020103@sendu.me.uk>
Chris Fields wrote:
>> What about the existing genus(), species(), sub_species() and variant()
>> methods? There would be no need for any logic to join things together,
>> but I would still like to be able to get just 'sapiens' from somewhere.
>> Can I use species() for that purpose (though again, species is strictly
>> 'Homo sapiens')? Likewise sub_species() and variant() could hold the
>> remaining non-redundant names. Or should all of these be deprecated
>> because they don't really have a place in a generic Node class?
>
> This is where Hilmar suggests that you have a bit of freedom in doing what
> you want, as with binomial(). So species() should return species
> ('sapiens'), genus return genus, etc.
[regarding changes to Bio::Taxonomy::Node]
Actually, I'm really strongly leaning toward getting rid of the
following methods and new() options (and giving up entirely on being
able to keep 'sapiens' somewhere):
-organelle, organelle()
-division, division()
-sub_species, sub_species()
-variant, variant()
species(), validate_species_name()
genus()
binomial()
As far as I can see none of these methods have any place in a generic
Node class. If you want to know what your species is you have to be
rank() 'species' and you just call scientific_name(). The above kind of
methods belong in something like Bio::Species or similar, NOT in Node.
Does anyone disagree? Can anyone offer a justification for keeping these
methods?
Changes I haven't yet discussed but have already made (but not committed):
*parent_taxon_id = \&parent_id;
*common_name = \&common_names;
-factory and factory() removed, since there is no
Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
of a factory once set, and a factory seems redundant when we're a node
with a -dbh.
validate_name() removed because it just returns 1.
>> What about node_name()? Yet another synonym of scientific_name? (right
>> now it grabs the common name(s)). Ugh.
>
> I agree things need cleaning up. You could always make node_name() an alias
> for scientific_name() though it could just be deprecated.
Actually, I've gone with node_name as the 'pure' and best method to set
the name of your node with, and made scientific_name an alias of it
(though it behaves as suggested earlier in the thread).
>> What should I do with the classification array? Should it hold the raw
>> ScientificName like:
>> join(',', $node->classification) eq 'Homo sapiens, Homo,
>> Homo/Pan/Gorilla group [...]'?
(I've decided to do it the above way for consistency with scientific_name)
>> Or should it be like:
>> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla
>> group [...]'?
>
> Don't know what the dump file gives; the XML output using efetch via entrez
> has the raw lineage (as appears in a GenBank sequence file) and the actual
> full lineage with TaxID, rank, 'scientific name,' in the actual lineage
> order. I think one problem area will be the 'no rank' designations in the
> lineage. Note that the below example also has a species and no genus;
> tricky!
Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
when they build the classification array. I had no intention of changing
this behaviour.
> 1760
> Actinobacteria (class)
> class
Ugh. I guess my proposal to remove <> bits via flatfile extends to
removing () bits via entrez. We don't need unique names; we can use
object_id() when uniqueness matters.
>> I don't think binomial() would serve any useful purpose now, however.
>
> We could use binomial() for the 'scientific name' as the rest of the world
> knows it (as in binomial nomenclature), having it built from genus-species
> like you had originally suggested.
No, see above. I don't think it makes the slightest bit of sense for a
Node to go around trying to build things from a parent it may or may not
have. Again, binomial() is a method for something like Bio::Species, not
a generic Node class.
From cjfields at uiuc.edu Tue Jul 18 15:34:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 14:34:29 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BD147A.9020103@sendu.me.uk>
Message-ID: <000901c6aaa1$328dd3d0$15327e82@pyrimidine>
...
> [regarding changes to Bio::Taxonomy::Node]
>
> Actually, I'm really strongly leaning toward getting rid of the
> following methods and new() options (and giving up entirely on being
> able to keep 'sapiens' somewhere):
>
> -organelle, organelle()
> -division, division()
> -sub_species, sub_species()
> -variant, variant()
> species(), validate_species_name()
> genus()
> binomial()
>
> As far as I can see none of these methods have any place in a generic
> Node class. If you want to know what your species is you have to be
> rank() 'species' and you just call scientific_name(). The above kind of
> methods belong in something like Bio::Species or similar, NOT in Node.
> Does anyone disagree? Can anyone offer a justification for keeping these
> methods?
Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to
have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes
to Node will affect Bio::Species to some degree.
If you can get the lineage from XML, you could set many of these based on
the rank given. Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse
out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult
to leave some methods based on rank (genus, species, etc) as simple get/set
methods for the time being and leave the heavy lifting to the modules
dealing directly with the data.
Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node
fairly easily. If there is no genus/species data to be grabbed (either it
doesn't exist or isn't present for some reason), then simply leave it as
undef.
That's also why I thought binomial() could stick around; if you have both
the genus() and species() you could grab both using binomial(), building in
special cases or error handling in case genus() or species() or both return
undef. I don't see the problem in keeping this as long as users know what
it means: by detailing the method in POD. If someone complains we tell them
to RTFM.
> Changes I haven't yet discussed but have already made (but not committed):
>
> *parent_taxon_id = \&parent_id;
> *common_name = \&common_names;
> -factory and factory() removed, since there is no
> Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
> of a factory once set, and a factory seems redundant when we're a node
> with a -dbh.
> validate_name() removed because it just returns 1.
>
...
> Actually, I've gone with node_name as the 'pure' and best method to set
> the name of your node with, and made scientific_name an alias of it
> (though it behaves as suggested earlier in the thread).
I don't have any problem with that. As long as it conforms somewhat to the
NCBI definition to prevent confusion I think it's okay.
> >> What should I do with the classification array? Should it hold the raw
> >> ScientificName like:
> >> join(',', $node->classification) eq 'Homo sapiens, Homo,
> >> Homo/Pan/Gorilla group [...]'?
>
> (I've decided to do it the above way for consistency with scientific_name)
I think that's fine.
...
> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
> when they build the classification array. I had no intention of changing
> this behaviour.
If you ignore nodes with 'no rank' there will be major problems when
retrieving certain TaxID's from protein/nucleotide sequences. I had posted
some sample XML from many NCBI TaxIDs taken from sequence files and via
ELink and a good many of those nodes (most of them from genome projects)
have 'no rank'.
376686
Flavobacterium johnsoniae UW101
...
986
no rank
...
373903
Halothermothrix orenii H 168
...
31909
no rank
These aren't 'edge cases' anymore but now are pretty common from genome
sequencing. I would just assign 'no rank' to rank() and have the node
retained for DB purposes.
It seems that the tax dump loses quite a bit of information somewhere along
the way that shows up in the XML. Or am I wrong?
> > 1760
> > Actinobacteria (class)
> > class
>
> Ugh. I guess my proposal to remove <> bits via flatfile extends to
> removing () bits via entrez. We don't need unique names; we can use
> object_id() when uniqueness matters.
The XML parsing in Taxonomy::entrez will take care of the and retains
the character data in between. It would be a matter of setting the parser
correctly to grab the relevant data and assign it properly.
> >> I don't think binomial() would serve any useful purpose now, however.
> >
> > We could use binomial() for the 'scientific name' as the rest of the
> world
> > knows it (as in binomial nomenclature), having it built from genus-
> species
> > like you had originally suggested.
>
> No, see above. I don't think it makes the slightest bit of sense for a
> Node to go around trying to build things from a parent it may or may not
> have. Again, binomial() is a method for something like Bio::Species, not
> a generic Node class.
Bio::Species, from what I gather, was initially created to hold the tax data
from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware.
Bio::Taxonomy::Node was supposed to be like Bio::Species and also be
DB-aware:
http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321
Again, Bio::Species methods are supposed to (eventually) delegate to
Bio::Taxonomy::Node, so the two are closely linked along with their methods.
Any way we go about it here (keeping certain methods and tossing others,
changing the data returned, etc), it looks like there will be API issues
down the road which will directly affect anyone using tax data. That
affects bioperl-db directly as well as any other bioperl-based DB's which
rely on tax data. So we need to tread a bit carefully when making major
changes to make sure that they work for bioperl-db and anywhere else that
may require it.
Chris
From cjfields at uiuc.edu Tue Jul 18 15:41:31 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 14:41:31 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BD147A.9020103@sendu.me.uk>
Message-ID: <000a01c6aaa2$2b4f50c0$15327e82@pyrimidine>
Sendu et al,
I'll play around with adding a quick method to Bio::Species for
scientific_name(); if I can get it to play nice with Bio::SeqIO::genbank and
it passes tests I'll commit it.
Chris
From golharam at umdnj.edu Tue Jul 18 15:36:54 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Tue, 18 Jul 2006 15:36:54 -0400
Subject: [Bioperl-l] advice
In-Reply-To: <003301c6aa81$a34fd8e0$15327e82@pyrimidine>
Message-ID: <00a501c6aaa1$86edb620$2f01a8c0@GOLHARMOBILE1>
Right. There was a chain letter going around the internet for awhile
about how you can leave out certain letters and the human brain will
still be able to correctly interpret what the word is supposed to be.
Either that or it was something about how Europe was adopting a new
variation of English and after many successions it started to sound/look
like German.
> The wide variety of pronunciation of English around the world can be
> mostly blamed on those damned vowels... so perhaps removing them helps
> one to reach a wider audience :-)
>
> > As an aside, doesn't Damian Conway say something about the non-use
> > of
> vowels
> > in 'Perl Best Practices?' I think it was in relation to variables,
> > though...
>
> Yeah, on page 46 he says NOT to remove vowels in variable names, use
> prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff.
From cjfields at uiuc.edu Tue Jul 18 17:44:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 16:44:29 -0500
Subject: [Bioperl-l] Bio::SeqIO::genbank and Bio::Species
Message-ID: <000001c6aab3$58ee7bd0$15327e82@pyrimidine>
For a given GenBank file, you'll have the following (this is from NCBI's
current flatfile format,
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html):
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...
The SOURCE line above, according to NCBI, contains an abbreviated name and a
common name (optional); it can also apparently contain additional
information, such as organelles and so on. The ORGANISM line contains
NCBI's definition of the formal scientific name (see the related thread on
Taxonomy proposed changes) along with lineage information
Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with
bacterial names, so when I process everything through SeqIO I get:
SOURCE Mycobacterium tuberculosis H37Rv H37Rv
ORGANISM Mycobacterium tuberculosis
SOURCE Mycobacterium tuberculosis CDC1551 CDC1551
ORGANISM Mycobacterium tuberculosis
SOURCE Mycobacterium avium subsp. paratuberculosis K-10
paratuberculosis K-10
ORGANISM Mycobacterium avium subsp.
SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911
ORGANISM Bacillus sp.
I have added a scientific_name() method to Bio::Species to contain the
string on the ORGANISM line and replace it as is, which seems to work well
(doesn't chop the name down). The bigger issue is the mess with the SOURCE
line. This stems from adding back information from sub_species(), which I
don't think needs to be done as it's supposed to be an abbreviated name.
Anybody mind if I try splitting up the original SOURCE line data into
organelle(), abbreviated_name(), and common_name()? This will change
common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give
'baker's yeast') but will also conform more to the NCBI definition of
'common name.' Also, organelle info isn't handled yet; I could toy with
adding support for it. Any objections?
I may proceed to do the same with EMBL, SwissPort, and others that use
Bio::Species if this works out.
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Tue Jul 18 18:50:37 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 18 Jul 2006 23:50:37 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <000901c6aaa1$328dd3d0$15327e82@pyrimidine>
References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine>
Message-ID: <44BD65BD.4030501@sendu.me.uk>
Chris Fields wrote:
> ...
>> [regarding changes to Bio::Taxonomy::Node]
>>
>> Actually, I'm really strongly leaning toward getting rid of the
>> following methods and new() options (and giving up entirely on being
>> able to keep 'sapiens' somewhere):
>>
>> -organelle, organelle()
>> -division, division()
>> -sub_species, sub_species()
>> -variant, variant()
>> species(), validate_species_name()
>> genus()
>> binomial()
>
> Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to
> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes
> to Node will affect Bio::Species to some degree.
I see from the original postings that Node was intended to be like
Species, but I don't think it makes the slightest bit of sense. A
/single/ Node need only (must only!) represent the information for a
single node in the taxonomy. Or else what do these objects mean? What is
the object model? It's bad bad bad for it to be sensible one way (when
you're making your own taxonomy by making your own nodes) and
nonsensical another (when we stuff in methods so that Bio::Species is
happy). The way Node is written right now, and what you're suggesting,
is that we stuff the entire Taxonomy into the Node. Well, except that
you don't even have methods for every taxonomic level - there is genus()
but no subphylum(). I can't emphasise strongly enough how insane all
this is.
The correct thing for Bio::Species to interact with is Bio::Taxonomy.
Bio::Taxonomy is a collection of Nodes and has the sort of methods that
Bio::Species would need to delegate its current functionality.
I'm quite willing to do a proper overhaul here so everything makes
sense. You either make your own nodes and add these to a Taxonomy or use
a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy
lets you discover the classification of any node it contains.
Bio::Species could implement a method like genus() by:
$node = $taxonomy->get_node('genus') || return;
return $node->scientific_name;
Bio::Taxonomy isn't perfect, but I can certainly get it to do its job.
I'd probably make it rank-name and order independent for starters.
Bio::Taxonomy::Node needs to be reduced right down to just hold data
about the node it represents, and possibly its parent node id (or other
way of getting to its parent). So now I'm proposing dropping the
classification() method from Node as well. It's simply not necessary;
Bio::Taxonomy should give you that information.
Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from
its docs, but it could be used to build a Taxonomy (that seems to be its
intent, I'm just not sure what some of the methods are really supposed
to do) such that Node might not even need any methods for getting its
parent or child nodes. The Factory or Taxonomy might be able to deal
with that.
In short, I'm proposing a major change to Bio::Taxonomy::Node (make it
just a node), and minor changes to (& implementation of) Bio::Taxonomy
and Bio::Taxonomy::FactoryI such that they actually get used to do their
jobs.
> That's also why I thought binomial() could stick around; if you have both
> the genus() and species() you could grab both using binomial(), building in
> special cases or error handling in case genus() or species() or both return
> undef.
binomial() would belong in (and is present in) Bio::Taxonomy. But in any
case, it's not needed there either; if you want the binomial you just
ask for the scientific_name of the species node in your Taxonomy, since
this now contains the actual scientific name == binomial.
binomial() in Bio::Taxonomy could be reimplemented as:
$node = $self->get_node('species') || return;
return $node->scientific_name;
>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>> when they build the classification array. I had no intention of changing
>> this behaviour.
>
> If you ignore nodes with 'no rank' there will be major problems when
> retrieving certain TaxID's from protein/nucleotide sequences.
This is only for the classification array, which is meaningless anyway
(there only for file-format compatibility). If you want the real
information you ask your Bio::Taxonomy (which asks each of its nodes).
This is the whole point of having Bio::Taxonomy in the first place.
It gives you great flexibility to do whatever you want to do.
>>> 1760
>>> Actinobacteria (class)
>>> class
>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>> removing () bits via entrez. We don't need unique names; we can use
>> object_id() when uniqueness matters.
>
> The XML parsing in Taxonomy::entrez will take care of the and retains
> the character data in between.
You misunderstood. I meant the <> bits I discussed at the very start of
this thread, that flatfile gives you. Here I'm referring to getting rid
of ' (class)' as well.
> Any way we go about it here (keeping certain methods and tossing others,
> changing the data returned, etc), it looks like there will be API issues
> down the road which will directly affect anyone using tax data. That
> affects bioperl-db directly as well as any other bioperl-based DB's which
> rely on tax data. So we need to tread a bit carefully when making major
> changes to make sure that they work for bioperl-db and anywhere else that
> may require it.
Does anything make serious use of the current Bio::Taxonomy code? Or are
they using Bio::Species?
From cjfields at uiuc.edu Wed Jul 19 00:38:05 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 18 Jul 2006 23:38:05 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BD65BD.4030501@sendu.me.uk>
References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine>
<44BD65BD.4030501@sendu.me.uk>
Message-ID:
I think we should wait a bit for any dramatic changes but implement
the ones there seems to be a consensus on. I understand your
reasoning for taking this on but I'm not sure completely revamping
Bio::Taxonomy w/o input from the core developers is wise, especially
since we do NOT know who uses it, why they use it, and how changing/
removing methods will affect their code. We are doing nothing
productive here by constantly butting heads on this and having
different opinions on what we think Bio::Taxonomy/Bio::Species is
best suited for, when neither one of us is actually sure about who
uses it and why. A reasonable solution is there but we must rely on
outside opinions in order to reach it, so I propose a short
moratorium on changes to Bio::Taxonomy/Bio::Species that radically
redefine the API on either class. BTW, for anbody following, I'm
perfectly comfortable if Sendu takes the lead on this and implements
his changes; I'm just not sure about stripping the class down to the
bare minimum.
So far, the only thing that has been proposed (and accepted by all)
is that scientific_name() hold the data for that tag in a node. I
think most here would agree that's fine; I've already added a get/set
to Bio::Species but haven't committed it yet. However, what you
propose doing below is refactoring the code and changing the API. I
agree there needs to be an overhaul but we can't do this w/o guidance
or input from the GBE (Great Bioperl Elders). I would like some of
the 'senior' core developers chime in a bit more on their thoughts on
this. Jason also mentioned somewhere that any changes for Taxonomy/
Species should be tracked on the wiki somewhere as well to make sure
everything is kosher and keep users up-to-date. I would like his
input here but I think he's still incommunicado at the moment.
Chris
On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> ...
>>> [regarding changes to Bio::Taxonomy::Node]
>>>
>>> Actually, I'm really strongly leaning toward getting rid of the
>>> following methods and new() options (and giving up entirely on being
>>> able to keep 'sapiens' somewhere):
>>>
>>> -organelle, organelle()
>>> -division, division()
>>> -sub_species, sub_species()
>>> -variant, variant()
>>> species(), validate_species_name()
>>> genus()
>>> binomial()
>>
>> Bio::Species and Bio::Taxonomy::Node are closely linked and plans
>> are to
>> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any
>> changes
>> to Node will affect Bio::Species to some degree.
>
> I see from the original postings that Node was intended to be like
> Species, but I don't think it makes the slightest bit of sense. A
> /single/ Node need only (must only!) represent the information for a
> single node in the taxonomy. Or else what do these objects mean?
> What is
> the object model? It's bad bad bad for it to be sensible one way (when
> you're making your own taxonomy by making your own nodes) and
> nonsensical another (when we stuff in methods so that Bio::Species is
> happy). The way Node is written right now, and what you're suggesting,
> is that we stuff the entire Taxonomy into the Node. Well, except that
> you don't even have methods for every taxonomic level - there is
> genus()
> but no subphylum(). I can't emphasise strongly enough how insane all
> this is.
>
> The correct thing for Bio::Species to interact with is Bio::Taxonomy.
> Bio::Taxonomy is a collection of Nodes and has the sort of methods
> that
> Bio::Species would need to delegate its current functionality.
>
> I'm quite willing to do a proper overhaul here so everything makes
> sense. You either make your own nodes and add these to a Taxonomy
> or use
> a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy
> lets you discover the classification of any node it contains.
> Bio::Species could implement a method like genus() by:
> $node = $taxonomy->get_node('genus') || return;
> return $node->scientific_name;
>
> Bio::Taxonomy isn't perfect, but I can certainly get it to do its job.
> I'd probably make it rank-name and order independent for starters.
>
> Bio::Taxonomy::Node needs to be reduced right down to just hold data
> about the node it represents, and possibly its parent node id (or
> other
> way of getting to its parent). So now I'm proposing dropping the
> classification() method from Node as well. It's simply not necessary;
> Bio::Taxonomy should give you that information.
>
> Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment
> from
> its docs, but it could be used to build a Taxonomy (that seems to
> be its
> intent, I'm just not sure what some of the methods are really supposed
> to do) such that Node might not even need any methods for getting its
> parent or child nodes. The Factory or Taxonomy might be able to deal
> with that.
>
> In short, I'm proposing a major change to Bio::Taxonomy::Node (make it
> just a node), and minor changes to (& implementation of) Bio::Taxonomy
> and Bio::Taxonomy::FactoryI such that they actually get used to do
> their
> jobs.
>
>
>> That's also why I thought binomial() could stick around; if you
>> have both
>> the genus() and species() you could grab both using binomial(),
>> building in
>> special cases or error handling in case genus() or species() or
>> both return
>> undef.
>
> binomial() would belong in (and is present in) Bio::Taxonomy. But
> in any
> case, it's not needed there either; if you want the binomial you just
> ask for the scientific_name of the species node in your Taxonomy,
> since
> this now contains the actual scientific name == binomial.
>
> binomial() in Bio::Taxonomy could be reimplemented as:
> $node = $self->get_node('species') || return;
> return $node->scientific_name;
>
>
>>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>>> when they build the classification array. I had no intention of
>>> changing
>>> this behaviour.
>>
>> If you ignore nodes with 'no rank' there will be major problems when
>> retrieving certain TaxID's from protein/nucleotide sequences.
>
> This is only for the classification array, which is meaningless anyway
> (there only for file-format compatibility). If you want the real
> information you ask your Bio::Taxonomy (which asks each of its nodes).
> This is the whole point of having Bio::Taxonomy in the first place.
>
> It gives you great flexibility to do whatever you want to do.
>
>
>>>> 1760
>>>> Actinobacteria (class)
>>>> class
>>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>>> removing () bits via entrez. We don't need unique names; we can use
>>> object_id() when uniqueness matters.
>>
>> The XML parsing in Taxonomy::entrez will take care of the
>> and retains
>> the character data in between.
>
> You misunderstood. I meant the <> bits I discussed at the very
> start of
> this thread, that flatfile gives you. Here I'm referring to getting
> rid
> of ' (class)' as well.
>
>
>> Any way we go about it here (keeping certain methods and tossing
>> others,
>> changing the data returned, etc), it looks like there will be API
>> issues
>> down the road which will directly affect anyone using tax data. That
>> affects bioperl-db directly as well as any other bioperl-based
>> DB's which
>> rely on tax data. So we need to tread a bit carefully when making
>> major
>> changes to make sure that they work for bioperl-db and anywhere
>> else that
>> may require it.
>
> Does anything make serious use of the current Bio::Taxonomy code?
> Or are
> they using Bio::Species?
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From ong at embl.de Wed Jul 19 03:51:48 2006
From: ong at embl.de (ong at embl.de)
Date: Wed, 19 Jul 2006 09:51:48 +0200
Subject: [Bioperl-l] Fwd: Re: BioPerl query
Message-ID: <20060719095148.f71b1v3p7qosk440@webmail.embl.de>
HI, Anybody have an answer to the below query? Thanks.
Regards,
Ong
----- Forwarded message from birney at ebi.ac.uk -----
Date: Wed, 19 Jul 2006 08:16:06 +0100
From: Ewan Birney
Reply-To: Ewan Birney
Subject: Re: BioPerl query
To: ong at embl.de
On 18 Jul 2006, at 10:26, ong at embl.de wrote:
> Dear Birney,
>
> Good day i wish to get your advise on how do i print out the PSM
> matrix from
> the code below. Thanks
>
I would ask this message on the bioperl list, not to me directly.
> Regards,
> Ong
>
> use Bio::Matrix::PSM::IO;
>
> my $psmIO=new Bio::Matrix::PSM::IO(-file=>'matrix.dat',-
> format=>'transfac');
> while (my $psm=$psmIO->next_psm) {
> my $id=$psm->id;
> my $an=$psm->accession_number;
> my $re = $psm->regexp;
> #my $l=$psm->width;
> my $cons=$psm->IUPAC;
> print"$id\t$an\t$re\t$l\t$cons\t$psm\n";
> }
----- End forwarded message -----
From rmb32 at cornell.edu Tue Jul 18 20:06:02 2006
From: rmb32 at cornell.edu (Robert Buels)
Date: Tue, 18 Jul 2006 17:06:02 -0700
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
Message-ID: <44BD776A.1080402@cornell.edu>
Hi all,
Here's a kind of abstract question about Bioperl and XML parsing:
I'm thinking about writing a bioperl parser for genomethreader XML, and
I'm sort of mulling over the 'impedence mismatch' between the way
bioperl Bio::*IO::* modules work and the way all of the current XML
parsers work. Bioperl uses a 'pull' model, where every time you want a
new chunk of stuff, you call $io_object->next_thing. All the XML
parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
'push' model, where every time they parse a chunk, they call _your_
code, usually via a subroutine reference you've given to the XML parser
when you start it up.
From what I can tell, current Bioperl IO modules that parse XML are
using push parsers to parse the whole document, holding stuff in memory,
then spoon-feeding it in chunks to the calling program when it calls
next_*(). This is fine until the input XML gets really big, in which
case you can quickly run out of memory.
Does anybody have good ideas for nice, robust ways of writing a bioperl
IO module for really big input XML files? There don't seem to be any
perl pull parsers for XML. All I've dug up so far would be having the
XML push parser running in a different thread or process, pushing chunks
of data into a pipe or similar structure that blocks the progress of the
push parser until the pulling bioperl code wants the next piece of data,
but there are plenty of ugly issues with that, whether one were too use
perl threads for it (aaagh!) or fork and push some kind of intermediate
format through a pipe or socket between the two processes (eek!).
So, um, if you've read this far, do you have any ideas?
Rob
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
From alc at sanger.ac.uk Wed Jul 19 06:55:12 2006
From: alc at sanger.ac.uk (Avril Coghlan)
Date: Wed, 19 Jul 2006 11:55:12 +0100
Subject: [Bioperl-l] parsing est2genome output
Message-ID: <1153306513.27383.12.camel@deskpro104.dynamic.sanger.ac.uk>
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060719/67f858ce/attachment.pl
From bernd.web at gmail.com Wed Jul 19 07:36:08 2006
From: bernd.web at gmail.com (Bernd Web)
Date: Wed, 19 Jul 2006 13:36:08 +0200
Subject: [Bioperl-l] SearchIO HOWTO
Message-ID: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com>
Hi,
On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO
parse your BLAST report.
In the Table of methods, the third line from the bottom is:
"HSP alignment Not available in this report Bio::SimpleAlign object "
Would it not be good to add the get_aln method ( $hsp->get_aln) ?
The line in "Using the methods"
my $alignment_as_string = $alnIO->write_aln($aln);
may be confusing: $alignment_as_string will be "1" on success and the
alignment is printed to STDIO. Should IO::String be introduced here
too set up a string filehandle?
Best regards,
Bernd
From hlapp at gmx.net Wed Jul 19 09:40:47 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 19 Jul 2006 09:40:47 -0400
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <44BD776A.1080402@cornell.edu>
References: <44BD776A.1080402@cornell.edu>
Message-ID: <73755CCF-2966-4580-BBEF-1F8A94CDC55D@gmx.net>
In the past the way this was done for potentially big XML files is to
use regex-based extraction of chunks that correspond to a object you
want to return per call to next_XXX(). That chunk would then be
passed on to the XML parser under the hood.
This only gets problematic once even the chunks are huge, or the name
of the element that encloses your chunk can be ambiguous with what's
in your text. The latter is unlikely though if you include the angle
brackets.
I believe this is how at least some bioperl parsers for XML-based
formats were written, and it seemed to work fine.
-hilmar
On Jul 18, 2006, at 8:06 PM, Robert Buels wrote:
> Hi all,
>
> Here's a kind of abstract question about Bioperl and XML parsing:
>
> I'm thinking about writing a bioperl parser for genomethreader XML,
> and
> I'm sort of mulling over the 'impedence mismatch' between the way
> bioperl Bio::*IO::* modules work and the way all of the current XML
> parsers work. Bioperl uses a 'pull' model, where every time you
> want a
> new chunk of stuff, you call $io_object->next_thing. All the XML
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> 'push' model, where every time they parse a chunk, they call _your_
> code, usually via a subroutine reference you've given to the XML
> parser
> when you start it up.
>
> From what I can tell, current Bioperl IO modules that parse XML are
> using push parsers to parse the whole document, holding stuff in
> memory,
> then spoon-feeding it in chunks to the calling program when it calls
> next_*(). This is fine until the input XML gets really big, in which
> case you can quickly run out of memory.
>
> Does anybody have good ideas for nice, robust ways of writing a
> bioperl
> IO module for really big input XML files? There don't seem to be any
> perl pull parsers for XML. All I've dug up so far would be having the
> XML push parser running in a different thread or process, pushing
> chunks
> of data into a pipe or similar structure that blocks the progress
> of the
> push parser until the pulling bioperl code wants the next piece of
> data,
> but there are plenty of ugly issues with that, whether one were too
> use
> perl threads for it (aaagh!) or fork and push some kind of
> intermediate
> format through a pipe or socket between the two processes (eek!).
>
> So, um, if you've read this far, do you have any ideas?
>
> Rob
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From jay at jays.net Wed Jul 19 09:43:52 2006
From: jay at jays.net (Jay Hannah)
Date: Wed, 19 Jul 2006 08:43:52 -0500 (CDT)
Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db
Message-ID:
Howdy --
I'm using bioperl-db + biosql-schema + mySQL.
I can now successfully build a biosql-schema instance in mySQL, load
taxonomy, then using bioperl-db load a GenBank file from disk, commiting
the sequences I want. For a given accession number + version + namespace,
I can tell bioperl-db to delete that from mySQL and it does. Yay!! I'll be
throwing a "Using bioperl-db" document onto the wiki over the next week.
What I am current baffled by:
How do I ask bioperl-db to walk over multiple bioentries in my database so
I can do things with them? The simplest possible example: print a list of
all bioentries in my database.
It is trivially easy to just query mySQL directly, but if I'm reading /
understanding the documentation correctly bioperl-db intends to be
database schema and RDBMS agnostic. In that case, I should use bioperl-db
to walk my records. So, how do I do that?
Is Bio::DB::Query::BioQuery the way to do this? The only way?
If so then can someone help me understand the datacollections() and
where() methods?
perldoc Bio::DB::Query::BioQuery
# all mouse sequences loaded under namespace ensembl that
# have receptor in their description
$query->datacollections(["Bio::PrimarySeqI e",
"Bio::Species=>Bio::PrimarySeqI sp",
"BioNamespace=>Bio::PrimarySeqI db"]);
$query->where(["sp.binomial like 'Mus *'",
"e.desc like '*receptor*'",
"db.namespace = 'ensembl'"]);
# all mouse sequences loaded under namespace ensembl that
# have receptor in their description, and that also have a
# cross-reference with SWISS as the database
$query->datacollections(["Bio::PrimarySeqI e",
"Bio::Species=>Bio::PrimarySeqI sp",
"BioNamespace=>Bio::PrimarySeqI db",
"Bio::Annotation::DBLink xref",
I'm bewildered by this API. Please forgive my ignorance.
1) How do I get *all* bioentries out of my database?
2) Say I did want just the "namespace" 'Pico' (one of my
biodatabase.name's). Where did
"BioNamespace=>Bio::PrimarySeqI db"]);
come from? How was I supposed to figure out the left hand side of that
mapping? The right hand side? If that line wasn't sitting in that document
was there a way for me to figure it out as a *user* of bioperl-db? Or
would I need to be a *programmer* of bioperl-db reading source to figure
this out? Where did
"db.namespace = 'ensembl'"]);
come from? Again, do I have to read source code to know how to invoke
that magic?
Sorry if I sound like a jerk. That is not my intention. Hopefully I can
document the answers for future bioperl-db'ers.
Thanks in advance,
j
my current plaything: http://openlab.jays.net
From cjfields at uiuc.edu Wed Jul 19 10:34:48 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 19 Jul 2006 09:34:48 -0500
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <44BD776A.1080402@cornell.edu>
Message-ID: <002801c6ab40$7cfcd980$15327e82@pyrimidine>
The Bio::SearchIO modules are supposed work like a SAX parser, where results
are returned as the report is parsed b/c of the occurrence of specific
'events' (start_element, end_element, and so on). However, the actual
behaviour for each module changes depending on the report type and the
author's intention.
There was a thread about a month ago on HMMPFAM report parsing where there
was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM
output has one HSP per hit and is sorted on the sequence length so a
particular hit can appear more than once, depending on how many times it
hits along the sequence length itself. So, to gather all the HSPs together
under one hit you would have to parse the entire report and build up a
Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
everything. Currently it just reports Hit/HSP pairs and it is up to the
user to build that tree.
In contrast, BLAST output should be capable of throwing hit/HSP clusters on
the fly based on the report output, but is quite slow (event the XML output
crawls). Jason thinks it's b/c of object inheritance and instantiation; I
think it's probably more complicated than that (there are a ton of method
calls which tend to slow things down quite a bit as well).
I would say try using SearchIO, but instead of relying directly on object
handler calls to create Hit/HSP objects using an object factory (which is
where I think a majority of the speed is lost), build the data internally on
the fly using start_element/end_element, then return hashes instead based on
the element type triggered using end_element.
As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
(using XML::SAX::ExpatXS/expat) and plan on switching it over to using
hashes at some point, possibly starting off with a different SearchIO plugin
module. If you have other suggestions (XML parser of choice, ways to speed
up parsing/retrieve data) we would be glad to hear them.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> Sent: Tuesday, July 18, 2006 7:06 PM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
> complicated
>
> Hi all,
>
> Here's a kind of abstract question about Bioperl and XML parsing:
>
> I'm thinking about writing a bioperl parser for genomethreader XML, and
> I'm sort of mulling over the 'impedence mismatch' between the way
> bioperl Bio::*IO::* modules work and the way all of the current XML
> parsers work. Bioperl uses a 'pull' model, where every time you want a
> new chunk of stuff, you call $io_object->next_thing. All the XML
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> 'push' model, where every time they parse a chunk, they call _your_
> code, usually via a subroutine reference you've given to the XML parser
> when you start it up.
>
> From what I can tell, current Bioperl IO modules that parse XML are
> using push parsers to parse the whole document, holding stuff in memory,
> then spoon-feeding it in chunks to the calling program when it calls
> next_*(). This is fine until the input XML gets really big, in which
> case you can quickly run out of memory.
>
> Does anybody have good ideas for nice, robust ways of writing a bioperl
> IO module for really big input XML files? There don't seem to be any
> perl pull parsers for XML. All I've dug up so far would be having the
> XML push parser running in a different thread or process, pushing chunks
> of data into a pipe or similar structure that blocks the progress of the
> push parser until the pulling bioperl code wants the next piece of data,
> but there are plenty of ugly issues with that, whether one were too use
> perl threads for it (aaagh!) or fork and push some kind of intermediate
> format through a pipe or socket between the two processes (eek!).
>
> So, um, if you've read this far, do you have any ideas?
>
> Rob
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Wed Jul 19 10:44:30 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 19 Jul 2006 09:44:30 -0500
Subject: [Bioperl-l] SearchIO HOWTO
In-Reply-To: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com>
Message-ID: <002901c6ab41$d7f61350$15327e82@pyrimidine>
The information in that table is referring to the BLAST report example
before the table itself. However, I can tell you that using that report
works (sorry if the text wrapping here mangles the output), so the table
information is erroneous. I'll do some updating on that.
Chris
Here's the script:
use Bio::SearchIO;
use Bio::AlignIO;
my $parser = Bio::SearchIO->new (-file => shift @ARGV,
-format => 'blast');
my $aln_out = Bio::AlignIO->new(-fh => \*STDOUT,
-format => 'clustalw');
while (my $result = $parser->next_result) {
while (my $hit = $result->next_hit) {
while (my $hsp = $hit->next_hsp) {
$aln_out->write_aln($hsp->get_aln);
}
}
}
Output (via STDOUT):
------------------------------------
CLUSTAL W(1.81) multiple sequence alignment
gi|20521485|dbj|AP004641.2/2896-3051
DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP
gb|443893|124775/197-246
DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP
*: : ********************** ***
********************
------------------------------------
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Bernd Web
> Sent: Wednesday, July 19, 2006 6:36 AM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] SearchIO HOWTO
>
> Hi,
>
> On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO
> parse your BLAST report.
> In the Table of methods, the third line from the bottom is:
> "HSP alignment Not available in this report Bio::SimpleAlign object "
>
> Would it not be good to add the get_aln method ( $hsp->get_aln) ?
>
> The line in "Using the methods"
> my $alignment_as_string = $alnIO->write_aln($aln);
>
> may be confusing: $alignment_as_string will be "1" on success and the
> alignment is printed to STDIO. Should IO::String be introduced here
> too set up a string filehandle?
>
>
> Best regards,
> Bernd
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Wed Jul 19 10:55:02 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 19 Jul 2006 09:55:02 -0500
Subject: [Bioperl-l] ListSummaries delay apologies
Message-ID: <002a01c6ab43$508aa5a0$15327e82@pyrimidine>
Sorry about the delay for the ListSummaries the past couple months; things
have been pretty hectic here which has put me really behind on them (it
hasn't ever been my top priority, anyway). We're getting papers ready for
publication, I going to a summer institute in a few weeks, and research (as
always) is full steam ahead.
Just so everybody know, I haven't given up on them, and plan on getting
caught up after I get back from the institute in Connecticut (beginning of
August).
Cheers!
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
From hlapp at gmx.net Wed Jul 19 11:31:50 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 19 Jul 2006 11:31:50 -0400
Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db
In-Reply-To:
References:
Message-ID: <62DA6CBC-CD0E-46A7-A669-71FFC808041B@gmx.net>
On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote:
> Howdy --
>
> I'm using bioperl-db + biosql-schema + mySQL.
>
> I can now successfully build a biosql-schema instance in mySQL, load
> taxonomy, then using bioperl-db load a GenBank file from disk,
> commiting
> the sequences I want. For a given accession number + version +
> namespace,
> I can tell bioperl-db to delete that from mySQL and it does. Yay!!
> I'll be
> throwing a "Using bioperl-db" document onto the wiki over the next
> week.
Excellent!
>
> What I am current baffled by:
>
> How do I ask bioperl-db to walk over multiple bioentries in my
> database so
> I can do things with them? The simplest possible example: print a
> list of
> all bioentries in my database.
>
> It is trivially easy to just query mySQL directly, but if I'm
> reading /
> understanding the documentation correctly bioperl-db intends to be
> database schema and RDBMS agnostic. In that case, I should use
> bioperl-db
> to walk my records. So, how do I do that?
Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic,
but that doesn't mean that you have to be as well. If you find it
trivially easy to query your database using SQL and DBI and you don't
care about being RDBMS or schema-variant agnostic, then by all means
don't feel obligated to go through the bioperl-db API for querying.
Note you can obtain the DBI database handle being used by a
persistence adaptor by calling dbh():
my $dbh = $adaptor->dbh();
(The advantage of this is that you use the same connection, and
therefore the same machinery for obtaining connection parameters and
building the DSN that the rest of bioperl-db uses. Also, you have the
ability to see transactions in progress that have not been committed
yet by the adaptor.)
What you should not do through SQL directly is modifying (UPDATE &
DELETE) entities which bioperl-db also holds in a cache (by default
terms, dbxrefs), unless you also take care to clear the cache of the
respective adaptor.
>
> Is Bio::DB::Query::BioQuery the way to do this? The only way?
Well, yes, unless you want to use SQL directly (which is not 0a
despised option, see above).
>
> If so then can someone help me understand the datacollections() and
> where() methods?
datacollections() in essence corresponds to the FROM clause in a SQL
statement, including JOIN statements. '=>' joins two entities in 1:n
relationship, '<=>' joins two entities in n:n relationship. Instead
of the table(s) you give the (Bioperl) objects that are to be joined,
and bioperl-db will translate the objects to database entities, i.e.,
tables. Each object may be followed by an alias. The alias makes it
easier to refer to the object (entity) in the query constraint part
(where()). A single alias following a join expression will always
apply to the master object (table).
>
> perldoc Bio::DB::Query::BioQuery
>
> # all mouse sequences loaded under namespace ensembl that
> # have receptor in their description
> $query->datacollections(["Bio::PrimarySeqI e",
> "Bio::Species=>Bio::PrimarySeqI sp",
> "BioNamespace=>Bio::PrimarySeqI
> db"]);
This is short for
$query->datacollections([ # enumare the objects we need:
"Bio::PrimarySeqI e",
"Bio::Species sp",
"BioNamespace db",
# specify master-detail relationships
"Bio::Species=>Bio::PrimarySeqI",
"BioNamespace=>Bio::PrimarySeqI"]);
because the alias following the join statement applies to the master
entity.
> $query->where(["sp.binomial like 'Mus *'",
> "e.desc like '*receptor*'",
> "db.namespace = 'ensembl'"]);
The where() method corresponds to the WHERE clause in SQL. The
default logical operator between constraints is AND. There is more
documentation in on the syntax of expressing constraints in
Bio::DB::Query::QueryConstraint.
The column for which to constrain the value is given as the attribute
(method) of the (bioperl) object. If there are multiple objects in
the 'datacollections' then you need to qualify each attribute by
prefixing it with the object, or the alias assigned in datacollections
(), followed by a dot; corresponding to typical OO syntax.
>
> # all mouse sequences loaded under namespace ensembl that
> # have receptor in their description, and that also have a
> # cross-reference with SWISS as the database
> $query->datacollections(["Bio::PrimarySeqI e",
> "Bio::Species=>Bio::PrimarySeqI sp",
> "BioNamespace=>Bio::PrimarySeqI db",
> "Bio::Annotation::DBLink xref",
>
> I'm bewildered by this API. Please forgive my ignorance.
I understand. This part of the API is by far the one with the
skimpiest documentation.
There are a considerable number of tests in t/query.t which may serve
as examples. They also are known to work if their tests don't fail.
The tests don't actually execute any query, instead some internal
guts are used to test the translation to SQL, so if you know SQL you
may be able to understand better what's going on by seeing the object-
level query and the SQL-level query side-by-side.
>
> 1) How do I get *all* bioentries out of my database?
Your datacollections would consist of the single object Bio::SeqI (or
Bio::PrimarySeqI if you didn't want any annotation), and there would
be no query constraint:
my $query = Bio::DB::Query::BioQuery->new(-datacollections=>
["Bio::SeqI"]);
>
> 2) Say I did want just the "namespace" 'Pico' (one of my
> biodatabase.name's). Where did
>
> "BioNamespace=>Bio::PrimarySeqI db"]);
>
> come from? How was I supposed to figure out the left hand side of that
> mapping? The right hand side? If that line wasn't sitting in that
> document
> was there a way for me to figure it out as a *user* of bioperl-db?
You would not know from Bioperl itself. The right hand side is a
Bioperl class. The left hand side is a kludge because Bioperl does
not have a namespace class, instead objects that have a namespace
implement the Bio::IdentifiableI interface directly. This kind of one
class mapping to two database entities (biodatabase is a table
separate from, in fact a master for, bioentry) is extremely
cumbersome to express in a generic way, so I chose to create a
Bio::DB::Persistent::BioNamespace class to represent that for the
purpose of queries.
> Or would I need to be a *programmer* of bioperl-db reading source
> to figure
> this out? Where did
>
> "db.namespace = 'ensembl'"]);
>
> come from? Again, do I have to read source code to know how to invoke
> that magic?
Well, I'm not sure even reading the source code clears it all up ;)
As I said before, the part before the dot is the alias or object, the
part after is the attribute (or method) to be constrained.
>
> Sorry if I sound like a jerk. That is not my intention. Hopefully I
> can
> document the answers for future bioperl-db'ers.
No problem, that's fine - and whatever you would be willing to
contribute to documentation would be highly appreciated.
-hilmar
>
> Thanks in advance,
>
> j
> my current plaything: http://openlab.jays.net
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From aaron.j.mackey at gsk.com Wed Jul 19 09:48:55 2006
From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com)
Date: Wed, 19 Jul 2006 09:48:55 -0400
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <44BD776A.1080402@cornell.edu>
Message-ID:
There are 3rd generation XML "Pull" parsers (also called "StAX" for
Streaming API for XML), but they seem to still be stuck in Java land (e.g.
"MXP1")
You could probably use POE to setup a state machine that used XML::Twig to
"push" units of XML content onto a stack, to be read by your "next_*" pull
method (where the XML::Twig push "stalled" until the "next_*" method was
called, and vice versa).
-Aaron
bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM:
> Hi all,
>
> Here's a kind of abstract question about Bioperl and XML parsing:
>
> I'm thinking about writing a bioperl parser for genomethreader XML, and
> I'm sort of mulling over the 'impedence mismatch' between the way
> bioperl Bio::*IO::* modules work and the way all of the current XML
> parsers work. Bioperl uses a 'pull' model, where every time you want a
> new chunk of stuff, you call $io_object->next_thing. All the XML
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> 'push' model, where every time they parse a chunk, they call _your_
> code, usually via a subroutine reference you've given to the XML parser
> when you start it up.
>
> From what I can tell, current Bioperl IO modules that parse XML are
> using push parsers to parse the whole document, holding stuff in memory,
> then spoon-feeding it in chunks to the calling program when it calls
> next_*(). This is fine until the input XML gets really big, in which
> case you can quickly run out of memory.
>
> Does anybody have good ideas for nice, robust ways of writing a bioperl
> IO module for really big input XML files? There don't seem to be any
> perl pull parsers for XML. All I've dug up so far would be having the
> XML push parser running in a different thread or process, pushing chunks
> of data into a pipe or similar structure that blocks the progress of the
> push parser until the pulling bioperl code wants the next piece of data,
> but there are plenty of ugly issues with that, whether one were too use
> perl threads for it (aaagh!) or fork and push some kind of intermediate
> format through a pipe or socket between the two processes (eek!).
>
> So, um, if you've read this far, do you have any ideas?
>
> Rob
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From arareko at campus.iztacala.unam.mx Wed Jul 19 12:20:21 2006
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Wed, 19 Jul 2006 11:20:21 -0500
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine>
References: <002801c6ab40$7cfcd980$15327e82@pyrimidine>
Message-ID: <44BE5BC5.5040006@campus.iztacala.unam.mx>
There are a lot of different XML processing strategies. Most fall into
two categories: stream-based and tree-based.
With the stream-based strategy, the parser continuously alerts a program
to patterns in the XML. The parser functions like a pipeline, taking XML
markup on one end and pumping out processed nuggets of data to your program.
With the tree-based strategy, the parser keeps the data to itself until
the very end, when it presents a complete model of the document to your
program. The whole point to this strategy is that your program can pull
out any data it needs, in any order.
Most of the times I use tree-based strategies because they place all of
the data into a structure which lets me to access any internal node
using array/hash references. The simplest parser for this is XML::Simple
using XML::Parser as the 'preferred parser' (which is built on top of
XML::Parser::Expat, which is a wrapper around the expat library).
More advanced parsers (both stream and tree-based) are:
* XML::LibXML (a wrapper for libxml2's C library)
* XML::Grove (takes a tree and changes it into an object hierarchy. Each
node type is represented by a different class)
* XML::PYX (for repackaging XML as a stream of easily recognizable and
transmutable symbols)
* XML::SimpleObject (changes a hierarchy of lists into a hierarchy of
objects)
* XML::XPath (for writing expressions that pinpoint specific pieces of
documents)
There are also some standards-based solutions like:
* XML::SAX (Simple API for XML) for event streams.
* XML::DOM (Document Object Model) for tree processing.
Your strategy of choice depends a lot on the type of XML files you want
to parse. Understanding the structure of the files and deciding which is
the data you want to extract from them is a fundamental step to choose
the appropriate method/parser to use.
Just my 2 cents :)
Regards,
Mauricio.
Chris Fields wrote:
> The Bio::SearchIO modules are supposed work like a SAX parser, where results
> are returned as the report is parsed b/c of the occurrence of specific
> 'events' (start_element, end_element, and so on). However, the actual
> behaviour for each module changes depending on the report type and the
> author's intention.
>
> There was a thread about a month ago on HMMPFAM report parsing where there
> was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM
> output has one HSP per hit and is sorted on the sequence length so a
> particular hit can appear more than once, depending on how many times it
> hits along the sequence length itself. So, to gather all the HSPs together
> under one hit you would have to parse the entire report and build up a
> Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> everything. Currently it just reports Hit/HSP pairs and it is up to the
> user to build that tree.
>
> In contrast, BLAST output should be capable of throwing hit/HSP clusters on
> the fly based on the report output, but is quite slow (event the XML output
> crawls). Jason thinks it's b/c of object inheritance and instantiation; I
> think it's probably more complicated than that (there are a ton of method
> calls which tend to slow things down quite a bit as well).
>
> I would say try using SearchIO, but instead of relying directly on object
> handler calls to create Hit/HSP objects using an object factory (which is
> where I think a majority of the speed is lost), build the data internally on
> the fly using start_element/end_element, then return hashes instead based on
> the element type triggered using end_element.
>
> As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
> (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> hashes at some point, possibly starting off with a different SearchIO plugin
> module. If you have other suggestions (XML parser of choice, ways to speed
> up parsing/retrieve data) we would be glad to hear them.
>
> Chris
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Tuesday, July 18, 2006 7:06 PM
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
>> complicated
>>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work. Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing. All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>> From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*(). This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files? There don't seem to be any
>> perl pull parsers for XML. All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>> of data into a pipe or similar structure that blocks the progress of the
>> push parser until the pulling bioperl code wants the next piece of data,
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM
From cjfields at uiuc.edu Wed Jul 19 14:45:55 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 19 Jul 2006 13:45:55 -0500
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <44BE5BC5.5040006@campus.iztacala.unam.mx>
Message-ID: <000301c6ab63$91d31680$15327e82@pyrimidine>
Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for
SearchIO::blastxml. It previously used XML::Parser::PerlSAX but that didn't
support SAX2-based parsing. XML::Twig is also used quite a bit
Jason added his thoughts about this to the wiki:
http://www.bioperl.org/wiki/XML_parsers
Personally, I use XML::Simple with EUtilities because the XML returned is
remarkably simple and normally fairly short. The trick is making sure when
parsing data to dereference everything properly since XML::Simple stores
everything in an elaborate data structure. I plan on switching to
XML::SAX::ExpatXS or XML::Twig soon.
Chris
> There are a lot of different XML processing strategies. Most fall into
> two categories: stream-based and tree-based.
>
> With the stream-based strategy, the parser continuously alerts a program
> to patterns in the XML. The parser functions like a pipeline, taking XML
> markup on one end and pumping out processed nuggets of data to your
> program.
>
> With the tree-based strategy, the parser keeps the data to itself until
> the very end, when it presents a complete model of the document to your
> program. The whole point to this strategy is that your program can pull
> out any data it needs, in any order.
>
> Most of the times I use tree-based strategies because they place all of
> the data into a structure which lets me to access any internal node
> using array/hash references. The simplest parser for this is XML::Simple
> using XML::Parser as the 'preferred parser' (which is built on top of
> XML::Parser::Expat, which is a wrapper around the expat library).
>
> More advanced parsers (both stream and tree-based) are:
>
> * XML::LibXML (a wrapper for libxml2's C library)
> * XML::Grove (takes a tree and changes it into an object hierarchy. Each
> node type is represented by a different class)
> * XML::PYX (for repackaging XML as a stream of easily recognizable and
> transmutable symbols)
> * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of
> objects)
> * XML::XPath (for writing expressions that pinpoint specific pieces of
> documents)
>
> There are also some standards-based solutions like:
>
> * XML::SAX (Simple API for XML) for event streams.
> * XML::DOM (Document Object Model) for tree processing.
>
> Your strategy of choice depends a lot on the type of XML files you want
> to parse. Understanding the structure of the files and deciding which is
> the data you want to extract from them is a fundamental step to choose
> the appropriate method/parser to use.
>
> Just my 2 cents :)
>
> Regards,
> Mauricio.
>
> Chris Fields wrote:
> > The Bio::SearchIO modules are supposed work like a SAX parser, where
> results
> > are returned as the report is parsed b/c of the occurrence of specific
> > 'events' (start_element, end_element, and so on). However, the actual
> > behaviour for each module changes depending on the report type and the
> > author's intention.
> >
> > There was a thread about a month ago on HMMPFAM report parsing where
> there
> > was some contention as to how to build hits(models)/HSPs(domains).
> HMMPFAM
> > output has one HSP per hit and is sorted on the sequence length so a
> > particular hit can appear more than once, depending on how many times it
> > hits along the sequence length itself. So, to gather all the HSPs
> together
> > under one hit you would have to parse the entire report and build up a
> > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> > everything. Currently it just reports Hit/HSP pairs and it is up to the
> > user to build that tree.
> >
> > In contrast, BLAST output should be capable of throwing hit/HSP clusters
> on
> > the fly based on the report output, but is quite slow (event the XML
> output
> > crawls). Jason thinks it's b/c of object inheritance and instantiation;
> I
> > think it's probably more complicated than that (there are a ton of
> method
> > calls which tend to slow things down quite a bit as well).
> >
> > I would say try using SearchIO, but instead of relying directly on
> object
> > handler calls to create Hit/HSP objects using an object factory (which
> is
> > where I think a majority of the speed is lost), build the data
> internally on
> > the fly using start_element/end_element, then return hashes instead
> based on
> > the element type triggered using end_element.
> >
> > As an aside, I'm trying to switch the SearchIO::blastxml over to
> XML::SAX
> > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> > hashes at some point, possibly starting off with a different SearchIO
> plugin
> > module. If you have other suggestions (XML parser of choice, ways to
> speed
> > up parsing/retrieve data) we would be glad to hear them.
> >
> > Chris
> >
> >
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> >> Sent: Tuesday, July 18, 2006 7:06 PM
> >> To: bioperl-l at bioperl.org
> >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
> >> complicated
> >>
> >> Hi all,
> >>
> >> Here's a kind of abstract question about Bioperl and XML parsing:
> >>
> >> I'm thinking about writing a bioperl parser for genomethreader XML, and
> >> I'm sort of mulling over the 'impedence mismatch' between the way
> >> bioperl Bio::*IO::* modules work and the way all of the current XML
> >> parsers work. Bioperl uses a 'pull' model, where every time you want a
> >> new chunk of stuff, you call $io_object->next_thing. All the XML
> >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> >> 'push' model, where every time they parse a chunk, they call _your_
> >> code, usually via a subroutine reference you've given to the XML parser
> >> when you start it up.
> >>
> >> From what I can tell, current Bioperl IO modules that parse XML are
> >> using push parsers to parse the whole document, holding stuff in
> memory,
> >> then spoon-feeding it in chunks to the calling program when it calls
> >> next_*(). This is fine until the input XML gets really big, in which
> >> case you can quickly run out of memory.
> >>
> >> Does anybody have good ideas for nice, robust ways of writing a bioperl
> >> IO module for really big input XML files? There don't seem to be any
> >> perl pull parsers for XML. All I've dug up so far would be having the
> >> XML push parser running in a different thread or process, pushing
> chunks
> >> of data into a pipe or similar structure that blocks the progress of
> the
> >> push parser until the pulling bioperl code wants the next piece of
> data,
> >> but there are plenty of ugly issues with that, whether one were too use
> >> perl threads for it (aaagh!) or fork and push some kind of intermediate
> >> format through a pipe or socket between the two processes (eek!).
> >>
> >> So, um, if you've read this far, do you have any ideas?
> >>
> >> Rob
> >>
> >> --
> >> Robert Buels
> >> SGN Bioinformatics Analyst
> >> 252A Emerson Hall, Cornell University
> >> Ithaca, NY 14853
> >> Tel: 503-889-8539
> >> rmb32 at cornell.edu
> >> http://www.sgn.cornell.edu
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From rmb32 at cornell.edu Wed Jul 19 15:30:28 2006
From: rmb32 at cornell.edu (Robert Buels)
Date: Wed, 19 Jul 2006 12:30:28 -0700
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To:
References:
Message-ID: <44BE8854.8010301@cornell.edu>
POE is a really neat thing, I didn't know about it before. Something
tells me, however, that I would have trouble convincing people to
install POE as a dependency for a genomethreader output parser. ;-) I
hope I'll have the opportunity to use it sometime.
For the curious, here's a nice intro to POE:
http://perl.com/pub/a/2001/01/poe.html
And the POE main site:
http://poe.perl.org/
Rob
aaron.j.mackey at GSK.COM wrote:
> There are 3rd generation XML "Pull" parsers (also called "StAX" for
> Streaming API for XML), but they seem to still be stuck in Java land (e.g.
> "MXP1")
>
> You could probably use POE to setup a state machine that used XML::Twig to
> "push" units of XML content onto a stack, to be read by your "next_*" pull
> method (where the XML::Twig push "stalled" until the "next_*" method was
> called, and vice versa).
>
> -Aaron
>
> bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM:
>
>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work. Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing. All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>> From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>>
>
>
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*(). This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files? There don't seem to be any
>> perl pull parsers for XML. All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>>
>
>
>> of data into a pipe or similar structure that blocks the progress of the
>>
>
>
>> push parser until the pulling bioperl code wants the next piece of data,
>>
>
>
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
>
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
From dwaner at scitegic.com Wed Jul 19 15:47:58 2006
From: dwaner at scitegic.com (dwaner at scitegic.com)
Date: Wed, 19 Jul 2006 12:47:58 -0700
Subject: [Bioperl-l] EMBL release 87 format changes.
Message-ID:
BioPerl Users and Developers,
I have updated the EMBL SeqIO parser to work correctly with Release 87 of
EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier
message, the EMBL parser now reads both new and old formats, but only
writes the new format.
I don't think that my changes will affect most users, but if you are using
the EMBL format can you review the changes described below and speak up if
anything looks like it could create a problem for you?
If I don't hear any objections soon, I will submit a patch to bugzilla.
Thanks,
- David
Parser changes:
- EMBL files no longer contain the "entry name". When reading old format
files,
the EMBL "entry name" from the ID line is used as the Bio::Seq::id and
Bio::Seq::display_id, but when reading new format files, the accession
number
is used for these fields.
Changes to output:
- The ID line was changed to the new format.
- The SV line is never written; SV is now part of the ID line.
- "DNA" and "RNA" are no longer valid EMBL molecule types. They are now
written
as "unassigned DNA" and "unassigned RNA"
- Strictly speaking, EMBL format should only be used for nucleotide
sequences.
If the alphabet is 'protein', write_seq() emits a warning and writes the
non-standard molecule type "AA" in the ID line.
- Because BioPerl sequences do not have a "data class" attribute, all
sequences
are written with a data class of "STD" in the ID line.
- The ID line contains the Bio::Seq::accession, unless it is missing, in
which
case the Bio::Seq::id is used.
- molecule type is strictly validated. Non-EMBL values are output as
"unassigned DNA" or "unassigned RNA", depending on the sequence
alphabet.
- "taxonomic division" is strictly validated. Non-EMBL values are output
as "UNC".
- The taxonomic division code "UNK" is now written as "UNC"
(unclassified).
Possible Gotchas for some users:
- Because the EMBL entry name is no longer included anywhere in the file,
when round-tripping from old format to new format the entry name will be
lost.
- In order to ensure that BioPerl writes valid EMBL files, I have added
strict
validation to the writer for "molecule type" and "taxonomic division".
This
could present a problem for users who are using non-standard values for
these
fields, but I felt it was important to write files that adhere to the
EMBL spec.
From slenk at emich.edu Wed Jul 19 16:04:16 2006
From: slenk at emich.edu (Stephen Gordon Lenk)
Date: Wed, 19 Jul 2006 16:04:16 -0400
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
Message-ID: <13edac5b13ed8208.13ed820813edac5b@emich.edu>
Hi,
I have found that POE fails to execute a periodic task after 32
iterations in a Perl thread, consistent failure on both XP and OSX -
if I knew how to write up a defect for Perl I would do this (hint ?
how is this done - I'm *not* asking RTFM etc) - probably remiss for
not doing so - I was going to write messages to a Controller Area
Network (CAN) to control automotive widgets from Perl - I wound up
using a C code exe (piped to from Perl) with its own threads to do
this. Oh yes I believe that bio lab systems can be done this way as
well.
But ... POE is really neat if you think in state machine terms. I have
an alternate architecture for my test harness (Perlizer) that would
use POE to run tests with CAN and GPIB.
Steve Lenk
----- Original Message -----
From: Robert Buels
Date: Wednesday, July 19, 2006 3:30 pm
Subject: Re: [Bioperl-l] bioperl pulls, xml parsers push, and things
get complicated
> POE is a really neat thing, I didn't know about it before.
> Something
> tells me, however, that I would have trouble convincing people to
> install POE as a dependency for a genomethreader output parser. ;-
> ) I
> hope I'll have the opportunity to use it sometime.
>
> For the curious, here's a nice intro to POE:
> http://perl.com/pub/a/2001/01/poe.html
> And the POE main site:
> http://poe.perl.org/
>
> Rob
>
> aaron.j.mackey at GSK.COM wrote:
> > There are 3rd generation XML "Pull" parsers (also called "StAX"
> for
> > Streaming API for XML), but they seem to still be stuck in Java
> land (e.g.
> > "MXP1")
> >
> > You could probably use POE to setup a state machine that used
> XML::Twig to
> > "push" units of XML content onto a stack, to be read by your
> "next_*" pull
> > method (where the XML::Twig push "stalled" until the "next_*"
> method was
> > called, and vice versa).
> >
> > -Aaron
> >
> > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006
> 08:06:02 PM:
> >
> >
> >> Hi all,
> >>
> >> Here's a kind of abstract question about Bioperl and XML parsing:
> >>
> >> I'm thinking about writing a bioperl parser for genomethreader
> XML, and
> >> I'm sort of mulling over the 'impedence mismatch' between the
> way
> >> bioperl Bio::*IO::* modules work and the way all of the current
> XML
> >> parsers work. Bioperl uses a 'pull' model, where every time
> you want a
> >> new chunk of stuff, you call $io_object->next_thing. All the
> XML
> >> parsers (including XML::SAX, XML::Parser::PerlSAX and
> XML::Twig) use a
> >> 'push' model, where every time they parse a chunk, they call
> _your_
> >> code, usually via a subroutine reference you've given to the
> XML parser
> >> when you start it up.
> >>
> >> From what I can tell, current Bioperl IO modules that parse
> XML are
> >> using push parsers to parse the whole document, holding stuff
> in memory,
> >>
> >
> >
> >> then spoon-feeding it in chunks to the calling program when it
> calls
> >> next_*(). This is fine until the input XML gets really big, in
> which
> >> case you can quickly run out of memory.
> >>
> >> Does anybody have good ideas for nice, robust ways of writing a
> bioperl
> >> IO module for really big input XML files? There don't seem to
> be any
> >> perl pull parsers for XML. All I've dug up so far would be
> having the
> >> XML push parser running in a different thread or process,
> pushing chunks
> >>
> >
> >
> >> of data into a pipe or similar structure that blocks the
> progress of the
> >>
> >
> >
> >> push parser until the pulling bioperl code wants the next piece
> of data,
> >>
> >
> >
> >> but there are plenty of ugly issues with that, whether one were
> too use
> >> perl threads for it (aaagh!) or fork and push some kind of
> intermediate
> >> format through a pipe or socket between the two processes (eek!).
> >>
> >> So, um, if you've read this far, do you have any ideas?
> >>
> >> Rob
> >>
> >> --
> >> Robert Buels
> >> SGN Bioinformatics Analyst
> >> 252A Emerson Hall, Cornell University
> >> Ithaca, NY 14853
> >> Tel: 503-889-8539
> >> rmb32 at cornell.edu
> >> http://www.sgn.cornell.edu
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> >
>
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY 14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From cjfields at uiuc.edu Wed Jul 19 17:46:43 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 19 Jul 2006 16:46:43 -0500
Subject: [Bioperl-l] EMBL release 87 format changes.
In-Reply-To:
Message-ID: <000601c6ab7c$d39d8cd0$15327e82@pyrimidine>
You can go ahead and submit the patch to Bugzilla anyway. Comments about
the proposed changes from the developers can be added there.
I think there's some confusion here, though: the EMBL SeqIO change you
mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I
haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and
new swiss data files and writes only new format; no major changes have been
made to SeqIO::embl in about a year (and even that was a small one).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com
> Sent: Wednesday, July 19, 2006 2:48 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] EMBL release 87 format changes.
>
> BioPerl Users and Developers,
>
> I have updated the EMBL SeqIO parser to work correctly with Release 87 of
> EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier
> message, the EMBL parser now reads both new and old formats, but only
> writes the new format.
>
> I don't think that my changes will affect most users, but if you are using
> the EMBL format can you review the changes described below and speak up if
> anything looks like it could create a problem for you?
>
> If I don't hear any objections soon, I will submit a patch to bugzilla.
>
> Thanks,
>
> - David
>
> Parser changes:
>
> - EMBL files no longer contain the "entry name". When reading old format
> files,
> the EMBL "entry name" from the ID line is used as the Bio::Seq::id and
> Bio::Seq::display_id, but when reading new format files, the accession
> number
> is used for these fields.
>
> Changes to output:
>
> - The ID line was changed to the new format.
>
> - The SV line is never written; SV is now part of the ID line.
>
> - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now
> written
> as "unassigned DNA" and "unassigned RNA"
>
> - Strictly speaking, EMBL format should only be used for nucleotide
> sequences.
> If the alphabet is 'protein', write_seq() emits a warning and writes the
>
> non-standard molecule type "AA" in the ID line.
>
> - Because BioPerl sequences do not have a "data class" attribute, all
> sequences
> are written with a data class of "STD" in the ID line.
>
> - The ID line contains the Bio::Seq::accession, unless it is missing, in
> which
> case the Bio::Seq::id is used.
>
> - molecule type is strictly validated. Non-EMBL values are output as
> "unassigned DNA" or "unassigned RNA", depending on the sequence
> alphabet.
>
> - "taxonomic division" is strictly validated. Non-EMBL values are output
> as "UNC".
>
> - The taxonomic division code "UNK" is now written as "UNC"
> (unclassified).
>
> Possible Gotchas for some users:
>
> - Because the EMBL entry name is no longer included anywhere in the file,
> when round-tripping from old format to new format the entry name will be
> lost.
>
> - In order to ensure that BioPerl writes valid EMBL files, I have added
> strict
> validation to the writer for "molecule type" and "taxonomic division".
> This
> could present a problem for users who are using non-standard values for
> these
> fields, but I felt it was important to write files that adhere to the
> EMBL spec.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From stewarta at nmrc.navy.mil Wed Jul 19 18:00:26 2006
From: stewarta at nmrc.navy.mil (Andrew Stewart)
Date: Wed, 19 Jul 2006 18:00:26 -0400
Subject: [Bioperl-l] #bioperl
Message-ID:
Wandering about the new bioperl.org page, I noticed that there's
never really been much mention of starting up a bioperl chat channel
on IRC for casual bioperl discussion and support. This has worked
really well for projects like MediaWiki, etc. I'll sit on the
channel for awhile and maybe we can see if the idea picks up.
Point your favorite IRC client to... (windows users I would suggest
mIRC, mac I would suggest Colloquy)
server: irc.freenode.net
channel: #bioperl
Hope to see you there.
--
Andrew Stewart
Research Assistant, Genomics Team
Navy Medical Research Center (NMRC)
Biological Defense Research Directorate (BDRD)
BDRD Annex
12300 Washington Avenue, 2nd Floor
Rockville, MD 20852
email: stewarta at nmrc.navy.mil
phone: 301-231-6700 Ext 270
From rmb32 at cornell.edu Wed Jul 19 18:40:52 2006
From: rmb32 at cornell.edu (Robert Buels)
Date: Wed, 19 Jul 2006 15:40:52 -0700
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine>
References: <002801c6ab40$7cfcd980$15327e82@pyrimidine>
Message-ID: <44BEB4F4.1060407@cornell.edu>
Hi Chris,
It seems to me the SearchIO framework isn't really appropriate for
genomethreader, since it's more of a gene prediction program than a
search/alignment program.
Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is
fundamentally different from the other bioperl IO systems, it still has
a next_this(), next_that() interface,which means lots of buffering
memory if you're doing your actual parsing with a push parser (or a tree
parser, of course, which is buffering an expanded form of the entire
document). It looks like it just adds another layer of method calls for
parser events, allowing the SearchIO to make different kinds of objects
and stuff.
It looks like none of this changes the fact that these are all push
parsers, and bioperl pulls, so you have to buffer a lot of stuff. I
guess the only really general strategies for reducing the buffering is
a.) to break up the XML with regexps and such like Hilmar said, b.) to
put your push parser in another process, and somehow keep it blocking in
one of its callbacks until you're ready for its next data.
I think what I'll do with the gthxml parser is find a way to split the
input XML into chunks and run a parser separately on each, like Hilmar
said. If more performance is needed, maybe a multi-process approach
would be appropriate, but not yet.
Anyway, looking at blastxml, I have some ruminations, which fill the
rest of this email:
Looking at SearchIO::blastxml, it looks like it's already using
XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that
recent? Is blastxml faster when using the tempfile option than when
putting the whole report in a string in memory? If you're looking for
speed gains, have you tried running some kind of profiling on it?
Whenever one is out to optimize code, profiling should be stop number
one. Almost every time, you will be surprised at what parts of the code
are actually eating up the most time. Here's a perl profiling intro:
http://perl.com/pub/a/2004/06/25/profiling.html . The profiling
mechansim talked about in that article is kind of old, there are also a
bunch of newer code profiling tools available on CPAN. I haven't used
any of them though. But yeah, I can't emphasize enough the importance
of profiling if you're trying to optimize for speed.
As for memory, the blastxml parser suffers from the same handicap I was
pondering at the start of this thread. To see what I mean, think of
what would happen if there were somehow 10 million HSPs in one of the
reports? It's buffering all of them before returning each result, and
your machine could melt. :-) Things would be beautiful (and fast,
probably) if next_hsp() would actually parse the next HSP in the report
instead of just returning a HSP object that's sitting in memory. But
there's not really anything that can be done about that, I don't think.
One nice thing, the blastxml parser's memory footprint doesn't really
suffer if you have 100,000 blast reports in your input file, because it
splits out the reports and parses each one individually. This I think
is a good illustration of what Hilmar was talking about, breaking the
input XML into chunks cuts down on the amount of buffering you have to do.
As XML parsers go, I kind of like XML::Twig, because it manages to
combine most of the easy use of a DOM/tree parser with the better memory
usage and speed of a push parser (like SAX and XML::Parser). Within a
parser callback, you have a DOM-like tree that's just the part of your
XML document you're interested in at that time, and then you free that
structure when you're done picking things out of it. I'm not sure how
fast it is, though, probably not as fast as ExpatXS. At any rate, it is
definitely a lot more intuitive to use than a more standard push parser,
since if you make good choices about what elements to use as the roots
of your twigs, you can often do your processing on a self-contained
chunk and not have to keep track of a bunch of parse state like you
typically need with a straight push parser like XML::Parser or a SAX parser.
Rob
Chris Fields wrote:
> The Bio::SearchIO modules are supposed work like a SAX parser, where results
> are returned as the report is parsed b/c of the occurrence of specific
> 'events' (start_element, end_element, and so on). However, the actual
> behaviour for each module changes depending on the report type and the
> author's intention.
>
> There was a thread about a month ago on HMMPFAM report parsing where there
> was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM
> output has one HSP per hit and is sorted on the sequence length so a
> particular hit can appear more than once, depending on how many times it
> hits along the sequence length itself. So, to gather all the HSPs together
> under one hit you would have to parse the entire report and build up a
> Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> everything. Currently it just reports Hit/HSP pairs and it is up to the
> user to build that tree.
>
> In contrast, BLAST output should be capable of throwing hit/HSP clusters on
> the fly based on the report output, but is quite slow (event the XML output
> crawls). Jason thinks it's b/c of object inheritance and instantiation; I
> think it's probably more complicated than that (there are a ton of method
> calls which tend to slow things down quite a bit as well).
>
> I would say try using SearchIO, but instead of relying directly on object
> handler calls to create Hit/HSP objects using an object factory (which is
> where I think a majority of the speed is lost), build the data internally on
> the fly using start_element/end_element, then return hashes instead based on
> the element type triggered using end_element.
>
> As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
> (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> hashes at some point, possibly starting off with a different SearchIO plugin
> module. If you have other suggestions (XML parser of choice, ways to speed
> up parsing/retrieve data) we would be glad to hear them.
>
> Chris
>
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Tuesday, July 18, 2006 7:06 PM
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
>> complicated
>>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work. Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing. All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>> From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*(). This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files? There don't seem to be any
>> perl pull parsers for XML. All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>> of data into a pipe or similar structure that blocks the progress of the
>> push parser until the pulling bioperl code wants the next piece of data,
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
From skirov at utk.edu Wed Jul 19 17:54:03 2006
From: skirov at utk.edu (Stefan Kirov)
Date: Wed, 19 Jul 2006 17:54:03 -0400
Subject: [Bioperl-l] Accessing TRANSFAC matrices
In-Reply-To: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de>
References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de>
Message-ID: <44BEA9FB.1070009@utk.edu>
I have nothing to do with TFBS (except for using it). I suggest you
contact Boris Lenhard who is behind TFBS.
Please also send bioperl questions to the list.
Finally, I believe TRANSFAC does not distribute the data files anymore.
However, if you find out this is not the case, please let me know.
Stefan
ong at embl.de wrote:
>HI ,
>
> Good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but
>it happens that about 50 matrices are missing after M00359 do you have any idea?
>Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how
>do i get the matrix.dat which is a transfac file?
>
> Tahnks and hear for you soon.
>
>REgards,
>Ong
>
>
From bix at sendu.me.uk Thu Jul 20 02:49:45 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 07:49:45 +0100
Subject: [Bioperl-l] Accessing TRANSFAC matrices
In-Reply-To: <44BEA9FB.1070009@utk.edu>
References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de>
<44BEA9FB.1070009@utk.edu>
Message-ID: <44BF2789.1090204@sendu.me.uk>
Stefan Kirov wrote:
> Finally, I believe TRANSFAC does not distribute the data files anymore.
> However, if you find out this is not the case, please let me know.
They get distributed as Transfac 'Pro', for which you need a license
(money).
> ong at embl.de wrote:
>> good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but
>> it happens that about 50 matrices are missing after M00359 do you have any idea?
What is meant by this? Missing from where? At the least, M00360 is
accessible via the website (public database).
>> Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how
>> do i get the matrix.dat which is a transfac file?
http://www.biobase-international.com/pages/index.php?id=174
From dhoworth at mrc-lmb.cam.ac.uk Thu Jul 20 05:19:22 2006
From: dhoworth at mrc-lmb.cam.ac.uk (Dave Howorth)
Date: Thu, 20 Jul 2006 10:19:22 +0100
Subject: [Bioperl-l] bioperl pulls, xml parsers push,
and things get complicated
In-Reply-To: <13edac5b13ed8208.13ed820813edac5b@emich.edu>
References: <13edac5b13ed8208.13ed820813edac5b@emich.edu>
Message-ID: <44BF4A9A.60100@mrc-lmb.cam.ac.uk>
Stephen Gordon Lenk wrote:
> I have found that POE fails to execute a periodic task after 32
> iterations in a Perl thread, consistent failure on both XP and OSX -
> if I knew how to write up a defect for Perl I would do this (hint ?
> how is this done - I'm *not* asking RTFM etc)
Generally:
Go to http://search.cpan.org and search for the module (POE).
Click on the distribution link, rather than the doc link (i.e.
POE-0.3502, which takes you to http://search.cpan.org/~rcaputo/POE-0.3502/).
Click on the View/Report Bugs link.
Check through the existing bugs and if it's not there click on the
Report a new bug link.
Cheers, Dave
From georg.otto at tuebingen.mpg.de Thu Jul 20 06:53:53 2006
From: georg.otto at tuebingen.mpg.de (Georg Otto)
Date: Thu, 20 Jul 2006 12:53:53 +0200
Subject: [Bioperl-l] Features in SeqIO GenBank output
Message-ID:
Hi,
this is probably a FAQ but I could not find anything to solve it.
I want to get sequences from GenBank and save them in GenBank
format. This works with the script shown below, but the "Features"
part is missing and contains references instead (see below). How can I
print out the complete GenBank entry?
I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7
Best,
Georg
Here is my script:
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::GenBank;
my $acc = 'AB017118';
my $db_obj = Bio::DB::GenBank->new();
my $seq_obj = $db_obj-> get_Seq_by_acc($acc);
my $out = Bio::SeqIO->new(-format => 'genbank',
-file => '>output.gb');
$out->write_seq($seq_obj);
Here is the output:
LOCUS AB017118 2038 bp mRNA linear VRT 06-JUN-2006
DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long
isoform, complete cds.
ACCESSION AB017118
VERSION AB017118.1 GI:4239978
KEYWORDS .
SOURCE Danio rerio (zebrafish)
ORGANISM Danio rerio
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
Cypriniformes; Cyprinidae; Danio.
REFERENCE 1
AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., Okamoto,H.,
Hayashi,S., Murakami,Y. and Matsufuji,S.
TITLE Two zebrafish (Danio rerio) antizymes with different expression
and activities
JOURNAL Biochem. J. 345 PT 1, 99-106 (2000)
PUBMED 10600644
REFERENCE 2 (bases 1 to 2038)
AUTHORS Matsufuji,S. and Saito,T.
TITLE Direct Submission
JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei University School
of Medicine, Department of Biochemistry II; 3-25-8 Nishishinbashi,
Minato-ku, Tokyo 105-8461, Japan (E-mail:senya at jikei.ac.jp,
Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897)
FEATURES Location/Qualifiers
source 1..2038
/db_xref="Bio::Annotation::SimpleValue=HASH(0x19b9a28)"
/mol_type="Bio::Annotation::SimpleValue=HASH(0x19b9b6c)"
/dev_stage="Bio::Annotation::SimpleValue=HASH(0x19b9bb4)"
/organism="Bio::Annotation::SimpleValue=HASH(0x19bfe18)"
/clone_lib="Bio::Annotation::SimpleValue=HASH(0x19bfe60)"
CDS join(45..224,226..702)
/db_xref="Bio::Annotation::SimpleValue=HASH(0x19c0960)"
/ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1
9beecc)"
/codon_start=Bio::Annotation::SimpleValue=HASH(0x19bef14)
/protein_id="Bio::Annotation::SimpleValue=HASH(0x19bef5c)"
/translation="Bio::Annotation::SimpleValue=HASH(0x19befa4)
"
/product="Bio::Annotation::SimpleValue=HASH(0x19befec)"
/note="Bio::Annotation::SimpleValue=HASH(0x19bf034)"
CDS 45..227
/db_xref="Bio::Annotation::SimpleValue=HASH(0x19bee24)"
/codon_start=Bio::Annotation::SimpleValue=HASH(0x19bf160)
/protein_id="Bio::Annotation::SimpleValue=HASH(0x19bf1cc)"
/translation="Bio::Annotation::SimpleValue=HASH(0x19c1830)
"
/note="Bio::Annotation::SimpleValue=HASH(0x19c1878)"
polyA_signal 2017..2022
polyA_site 2038
/note="Bio::Annotation::SimpleValue=HASH(0x19bffc8)"
BASE COUNT 439 a 377 c 532 g 690 t
ORIGIN
1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta aaatccaacc
1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat ttaaagac
//
From cjfields at uiuc.edu Thu Jul 20 08:43:08 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 07:43:08 -0500
Subject: [Bioperl-l] Features in SeqIO GenBank output
In-Reply-To:
References:
Message-ID: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu>
I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see
if this was fixed.
Chris
On Jul 20, 2006, at 5:53 AM, Georg Otto wrote:
>
> Hi,
>
> this is probably a FAQ but I could not find anything to solve it.
>
> I want to get sequences from GenBank and save them in GenBank
> format. This works with the script shown below, but the "Features"
> part is missing and contains references instead (see below). How can I
> print out the complete GenBank entry?
>
> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7
>
> Best,
>
> Georg
>
>
>
> Here is my script:
>
> use strict;
> use warnings;
>
> use Bio::Seq;
> use Bio::SeqIO;
> use Bio::DB::GenBank;
>
>
> my $acc = 'AB017118';
> my $db_obj = Bio::DB::GenBank->new();
> my $seq_obj = $db_obj-> get_Seq_by_acc($acc);
> my $out = Bio::SeqIO->new(-format => 'genbank',
> -file => '>output.gb');
> $out->write_seq($seq_obj);
>
>
>
> Here is the output:
>
> LOCUS AB017118 2038 bp mRNA linear VRT
> 06-JUN-2006
> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long
> isoform, complete cds.
> ACCESSION AB017118
> VERSION AB017118.1 GI:4239978
> KEYWORDS .
> SOURCE Danio rerio (zebrafish)
> ORGANISM Danio rerio
> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
> Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
> Cypriniformes; Cyprinidae; Danio.
> REFERENCE 1
> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y.,
> Okamoto,H.,
> Hayashi,S., Murakami,Y. and Matsufuji,S.
> TITLE Two zebrafish (Danio rerio) antizymes with different
> expression
> and activities
> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000)
> PUBMED 10600644
> REFERENCE 2 (bases 1 to 2038)
> AUTHORS Matsufuji,S. and Saito,T.
> TITLE Direct Submission
> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei
> University School
> of Medicine, Department of Biochemistry II; 3-25-8
> Nishishinbashi,
> Minato-ku, Tokyo 105-8461, Japan (E-
> mail:senya at jikei.ac.jp,
> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897)
> FEATURES Location/Qualifiers
> source 1..2038
> /db_xref="Bio::Annotation::SimpleValue=HASH
> (0x19b9a28)"
> /mol_type="Bio::Annotation::SimpleValue=HASH
> (0x19b9b6c)"
> /dev_stage="Bio::Annotation::SimpleValue=HASH
> (0x19b9bb4)"
> /organism="Bio::Annotation::SimpleValue=HASH
> (0x19bfe18)"
> /clone_lib="Bio::Annotation::SimpleValue=HASH
> (0x19bfe60)"
> CDS join(45..224,226..702)
> /db_xref="Bio::Annotation::SimpleValue=HASH
> (0x19c0960)"
> /
> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1
> 9beecc)"
> /codon_start=Bio::Annotation::SimpleValue=HASH
> (0x19bef14)
> /protein_id="Bio::Annotation::SimpleValue=HASH
> (0x19bef5c)"
> /translation="Bio::Annotation::SimpleValue=HASH
> (0x19befa4)
> "
> /product="Bio::Annotation::SimpleValue=HASH
> (0x19befec)"
> /note="Bio::Annotation::SimpleValue=HASH
> (0x19bf034)"
> CDS 45..227
> /db_xref="Bio::Annotation::SimpleValue=HASH
> (0x19bee24)"
> /codon_start=Bio::Annotation::SimpleValue=HASH
> (0x19bf160)
> /protein_id="Bio::Annotation::SimpleValue=HASH
> (0x19bf1cc)"
> /translation="Bio::Annotation::SimpleValue=HASH
> (0x19c1830)
> "
> /note="Bio::Annotation::SimpleValue=HASH
> (0x19c1878)"
> polyA_signal 2017..2022
> polyA_site 2038
> /note="Bio::Annotation::SimpleValue=HASH
> (0x19bffc8)"
> BASE COUNT 439 a 377 c 532 g 690 t
> ORIGIN
> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta
> aaatccaacc
>
>
>
>
> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat
> ttaaagac
> //
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Thu Jul 20 09:35:43 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 14:35:43 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BBBB69.6000906@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk>
Message-ID: <44BF86AF.8080408@sendu.me.uk>
Sendu Bala wrote:
> node 2 has name 'Bacteria ' and rank 'superkingdom'
> node 1386 has name 'Bacillus ' and rank 'genus'
> node 7776 has name 'Gnathostomata ' and rank 'superclass'
> etc.
>
> For me the bits in <> are inappropriate and shouldn't be there.
> [...]
> If there are no objections I'll strip the <> bits. I also plan to make
> $node->name('scientific', 'sapiens'); set and get the node name, and
> have flatfile and entrez store all common names with
> $obj->name('common', 'human', 'man');.
I'll describe all the changes I've now made and if no-one complains I'll
commit. (I've also made these notes into bug 2047 for easier reference
in the future.)
Bio::DB::Taxonomy::flatfile
---------------------------
# Bug-fixes
Removed invalid requirement that all species nodes have at least 7
named-rank parents.
The names->id solution used by get_taxonid() only stored that last id
associated with a name. However the name used wasn't necessarily unique,
such that multiple ids could match. names->id solution now remembers all
ids that match a name.
API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids()
and it returns an array of ids in list context. For backward
compatibility it returns one of the ids in scalar context, and
*get_taxonid = \&get_taxonids.
Added missing division ENV 'Environmental samples'.
# Improvements
Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the
common names, genetic code and mitochondrial genetic code in each node
it makes.
NOTE: entrez also stores creation, publication and update dates, but
this data is not available in the taxdump from NCBI ftp site.
NOTE: the common names are stored in no particular order; the genbank
common name in particular isn't necessarily the first in the list (cf.
old entrez.pm behaviour).
BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
division as a three letter code, like 'PRI'. However, for consistency
with entrez and the scientific_name() of the node the division is
supposed to correspond to, it is now stored as the full name, like
'Primates'.
The names->id solution also stores the artificially uniqued names like
'Craniata ', allowing you for the first time to retrieve the
correct id. Previously the search would have simply failed completely.
The names->id solution now handles nodes with scientific names of 'xyz
(class)', allowing you to retrieve the id with both get_taxonids('xyz')
and get_taxonids('xyz (class)'). Previously only the latter would work.
NOTE: the previous 2 changes (and the issues with entrez, see below)
make flatfile better at searching the taxonomy database than entrez
module or the website, both in terms of speed and completeness of results.
BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
always being sent directly to Bio::Taxonomy::Node->new(-name =>
$untouched) or the $node->classification() array. Previously, a species
node would have its name converted from 'Homo sapiens' to 'sapiens', but
the conversion mangled very badly certain other species names.
Bio::DB::Taxonomy::entrez
-------------------------
# Bug-fixes
Special characters like ", ( and ) in the input query string to
get_taxonid() result in the failure or inaccuracy of the search. These
characters are now removed prior to submission, allowing for correct
search results.
API-CHANGE: entrez has always been able to return multiple ids that
match a single input name, so I've renamed get_taxonid() to
get_taxonids() and it returns an array of ids in list context. It
returns one of the ids in scalar context. For backward compatibility,
*get_taxonid = \&get_taxonids.
NOTE: entrez modules (and website) cannot cope with '' in the
query, failing searches like 'Craniata '. For this reason, if
get_taxonids() is given a query with '' it will immediately
return undefined, saving a pointless website access. If you want the id
of 'Craniata ' you must search for 'Craniata', then get the
node for each returned id to see which one has a parent node with a
scientific_name() or common_names() case-insensitive matching to 'chordata'.
# Improvements
BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website.
BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
\(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
$untouched) or the $node->classification() array. Previously, a species
node would have its name converted from 'Homo sapiens' to 'sapiens', but
the conversion mangled very badly certain other species names.
BEHAVIOUR-CHANGE: all common names of a node are now stored in the
resulting Node object with Bio::Taxonomy::Node->new(-common_names =>
\@names). This means that the Genbank common name is now just one
amongst others, and isn't guaranteed to be the first in the list either.
Bio::Taxonomy::Node
-------------------
# Bug-fixes
non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes()
and get_LCA_Node() to work correctly.
classification() has a proper solution to finding the classification
when the array wasn't manually set.
# Improvements
BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now
it is an alias to name('scientific').
NOTE: node_name is what is set when ->new(-name => $name) is set, so
flatfile and entrez and user-created nodes now implicitly associate the
name of the node they create with its scientific name.
BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial().
Now it is *scientific_name = \&node_name.
binomial(), in addition to working the old way (assume first two
elements of classification array are species and genus, combine them),
will shortcut and return the scientific_name() if we are a node with
rank 'species' and scientific_name is two words. This makes binomial()
an effective synonym of scientific_name() when Nodes were constructed as
per flatfile or entrez, and when it is used correctly on a species node.
BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could
assign and retrieve different values to/from each method.)
New method common_names() supersedes common_name(), returning a list of
all common_names. For backward compatibility, returns one of the names
in scalar context, and *common_name = \&common_names.
-factory and factory() removed, since there is no
Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
of a factory once set, and a factory seems redundant when we're a node
with a -dbh.
species() and genus() issue a warning when you try to use them on a node
that isn't of rank 'species' (since they interact with the
classification array and not names('method') like the other similar
methods).
validate_name() removed because it just returns 1.
validate_species_name() removed because species() can (should) now
contain the real species name, like 'Homo sapiens', not 'sapiens'. But
it could also be any wonderfully complex thing, so there's nothing we
can confidently check for as being 'correct'.
t/Taxonomy.t
------------
Runs a slightly more comprehensive set of tests on entrez, which are now
only skipped if data retrieval fails.
Tests flatfile on a cut-down version of the taxdump.
> I'll also fix the problem with node names for ranks species and lower,
> as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
> subspecies/variant names', in the way I suggested there.
This hasn't been done per se, because we now store the real
ScientificName so there is no 'mishandling' to fix.
From bix at sendu.me.uk Thu Jul 20 09:49:04 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 14:49:04 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BF86AF.8080408@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk>
Message-ID: <44BF89D0.7090103@sendu.me.uk>
Sendu Bala wrote:
>
> Bio::DB::Taxonomy::flatfile
>
> BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
> always being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched) or the $node->classification() array. Previously, a species
> node would have its name converted from 'Homo sapiens' to 'sapiens', but
> the conversion mangled very badly certain other species names.
[...]
> Bio::DB::Taxonomy::entrez
>
> BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
> \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched) or the $node->classification() array. Previously, a species
> node would have its name converted from 'Homo sapiens' to 'sapiens', but
> the conversion mangled very badly certain other species names.
Oops. In both cases the scientific name has ' (class)' removed from it,
but the original name (with ' (class)') is stored as one of the common
names.
From georg.otto at tuebingen.mpg.de Thu Jul 20 10:29:33 2006
From: georg.otto at tuebingen.mpg.de (Georg Otto)
Date: Thu, 20 Jul 2006 16:29:33 +0200
Subject: [Bioperl-l] Features in SeqIO GenBank output
References:
<73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu>
Message-ID:
This indeed seems to be the case. After upgrading it works fine. Sorry
for stealing your time.
Georg
Chris Fields writes:
> I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see
> if this was fixed.
>
> Chris
>
> On Jul 20, 2006, at 5:53 AM, Georg Otto wrote:
>
>>
>> Hi,
>>
>> this is probably a FAQ but I could not find anything to solve it.
>>
>> I want to get sequences from GenBank and save them in GenBank
>> format. This works with the script shown below, but the "Features"
>> part is missing and contains references instead (see below). How can I
>> print out the complete GenBank entry?
>>
>> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7
>>
>> Best,
>>
>> Georg
>>
>>
>>
>> Here is my script:
>>
>> use strict;
>> use warnings;
>>
>> use Bio::Seq;
>> use Bio::SeqIO;
>> use Bio::DB::GenBank;
>>
>>
>> my $acc = 'AB017118';
>> my $db_obj = Bio::DB::GenBank->new();
>> my $seq_obj = $db_obj-> get_Seq_by_acc($acc);
>> my $out = Bio::SeqIO->new(-format => 'genbank',
>> -file => '>output.gb');
>> $out->write_seq($seq_obj);
>>
>>
>>
>> Here is the output:
>>
>> LOCUS AB017118 2038 bp mRNA linear VRT
>> 06-JUN-2006
>> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long
>> isoform, complete cds.
>> ACCESSION AB017118
>> VERSION AB017118.1 GI:4239978
>> KEYWORDS .
>> SOURCE Danio rerio (zebrafish)
>> ORGANISM Danio rerio
>> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>> Euteleostomi;
>> Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
>> Cypriniformes; Cyprinidae; Danio.
>> REFERENCE 1
>> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y.,
>> Okamoto,H.,
>> Hayashi,S., Murakami,Y. and Matsufuji,S.
>> TITLE Two zebrafish (Danio rerio) antizymes with different
>> expression
>> and activities
>> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000)
>> PUBMED 10600644
>> REFERENCE 2 (bases 1 to 2038)
>> AUTHORS Matsufuji,S. and Saito,T.
>> TITLE Direct Submission
>> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei
>> University School
>> of Medicine, Department of Biochemistry II; 3-25-8
>> Nishishinbashi,
>> Minato-ku, Tokyo 105-8461, Japan (E-
>> mail:senya at jikei.ac.jp,
>> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897)
>> FEATURES Location/Qualifiers
>> source 1..2038
>> /db_xref="Bio::Annotation::SimpleValue=HASH
>> (0x19b9a28)"
>> /mol_type="Bio::Annotation::SimpleValue=HASH
>> (0x19b9b6c)"
>> /dev_stage="Bio::Annotation::SimpleValue=HASH
>> (0x19b9bb4)"
>> /organism="Bio::Annotation::SimpleValue=HASH
>> (0x19bfe18)"
>> /clone_lib="Bio::Annotation::SimpleValue=HASH
>> (0x19bfe60)"
>> CDS join(45..224,226..702)
>> /db_xref="Bio::Annotation::SimpleValue=HASH
>> (0x19c0960)"
>> /
>> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1
>> 9beecc)"
>> /codon_start=Bio::Annotation::SimpleValue=HASH
>> (0x19bef14)
>> /protein_id="Bio::Annotation::SimpleValue=HASH
>> (0x19bef5c)"
>> /translation="Bio::Annotation::SimpleValue=HASH
>> (0x19befa4)
>> "
>> /product="Bio::Annotation::SimpleValue=HASH
>> (0x19befec)"
>> /note="Bio::Annotation::SimpleValue=HASH
>> (0x19bf034)"
>> CDS 45..227
>> /db_xref="Bio::Annotation::SimpleValue=HASH
>> (0x19bee24)"
>> /codon_start=Bio::Annotation::SimpleValue=HASH
>> (0x19bf160)
>> /protein_id="Bio::Annotation::SimpleValue=HASH
>> (0x19bf1cc)"
>> /translation="Bio::Annotation::SimpleValue=HASH
>> (0x19c1830)
>> "
>> /note="Bio::Annotation::SimpleValue=HASH
>> (0x19c1878)"
>> polyA_signal 2017..2022
>> polyA_site 2038
>> /note="Bio::Annotation::SimpleValue=HASH
>> (0x19bffc8)"
>> BASE COUNT 439 a 377 c 532 g 690 t
>> ORIGIN
>> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta
>> aaatccaacc
>>
>>
>>
>>
>> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat
>> ttaaagac
>> //
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
From prabubio at gmail.com Thu Jul 20 12:01:35 2006
From: prabubio at gmail.com (Prabu R)
Date: Thu, 20 Jul 2006 21:31:35 +0530
Subject: [Bioperl-l] Blast Output Parsing
Message-ID:
Dear All!
I am now trying to parse a Blast output using PERL.
I have to extract each alignment and have to parse the alignment. I mean, I
have to check whether a particular part of the given sequence got aligned
100%.
Anybody please tell me what module in PERL I have to use for getting this.
I've tried Bio::SearchIO. But I didnt get any method to get the alignment.
Kindly help.
Thanks,
R. Prabu
From cjfields at uiuc.edu Thu Jul 20 13:03:17 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 12:03:17 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BF86AF.8080408@sendu.me.uk>
Message-ID: <002901c6ac1e$66ea3820$15327e82@pyrimidine>
These all seem fine to me. Fantastic work! I added some comments but
everything seems fine to me.
I still plan on switching Bio::DB::Taxonomy::entrez to use
Bio::DB::EUtilities at some point but probably won't get around to it until
August; I still need to write up tests for the EUtilities modules. I may
add a method for retrieving tax data based on protein/nucleotide sequence
primary ID and relevant sequence database, so you could directly retrieve
the relevant TaxID w/o parsing sequences directly for them. This would
mainly be useful if you gather GIs from a BLAST search, for instance.
Anyway, I could add this in then base class Bio::DB::Taxonomy directly so
one could used the retrieved TaxIDs for flat-file or entrez searches; this
requires, of course, access to the remote Entrez database (it would use
ELink). Would that be of interest?
If so, I'll work on that and add relevant tests to Taxonomy.t when I can.
> Bio::DB::Taxonomy::flatfile
> ---------------------------
...
> API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids()
> and it returns an array of ids in list context. For backward
> compatibility it returns one of the ids in scalar context, and
> *get_taxonid = \&get_taxonids.
Returning a scalar makes sense as long as its noted in the POD. I have seen
similar methods return an array ref based on wantarray instead of a scalar,
but that largely depends on the complexity of the array (an array of hashes,
for instance).
...
> Bio::DB::Taxonomy::entrez
> -------------------------
...
> NOTE: entrez modules (and website) cannot cope with '' in the
> query, failing searches like 'Craniata '. For this reason, if
> get_taxonids() is given a query with '' it will immediately
> return undefined, saving a pointless website access. If you want the id
> of 'Craniata ' you must search for 'Craniata', then get the
> node for each returned id to see which one has a parent node with a
> scientific_name() or common_names() case-insensitive matching to
> 'chordata'.
It may be something with the esearch interface, though the direct TaxBrowser
query also seems to have problems with this:
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
I'll try looking into it to see if there is a more direct way to get those
(there probably isn't).
> # Improvements
> BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website.
>
> BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
> \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched) or the $node->classification() array. Previously, a species
> node would have its name converted from 'Homo sapiens' to 'sapiens', but
> the conversion mangled very badly certain other species names.
This actually relates to the similar comment made for
Bio::DB::Taxonomy::flatfle. The mangling probably depends on the current
node and whether using flatfile or XML (entrez). Most of the odd XML
examples I posted before, where the TaxID associated with a sequence had
extra data, were a rank of 'no rank'. The species rank, if present, has a
normal binomial name for :
Flavobacterium johnsoniae UW101
...
Flavobacterium johnsoniae
species
Pseudomonas putida F1
...
Pseudomonas putida
species
Caldicellulosiruptor saccharolyticus DSM
8903
...
Caldicellulosiruptor saccharolyticus
species
The genus rank has one name; the subspecies rank has the full species name
with 'subsp.' followed by the subspecies name. So, if using XML, one could
use the taxon subelements stored in the XML element to sort out
genus(), species(), subspecies(), and also higher order elements if someone
wanted to implement them.
This, of course, isn't necessary for the current changes, but down the road
if anybody wanted it...
...
> Bio::Taxonomy::Node
> -------------------
...
> species() and genus() issue a warning when you try to use them on a node
> that isn't of rank 'species' (since they interact with the
> classification array and not names('method') like the other similar
> methods).
I would just have genus() and species() issue warnings if they aren't set to
a particular value. So, if the current node is at the genus rank, genus()
will be set but species() won't be. And no need to do additional checking!
Fabulous work Sendu!
Chris
From cjfields at uiuc.edu Thu Jul 20 13:23:14 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 12:23:14 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BF89D0.7090103@sendu.me.uk>
Message-ID: <002a01c6ac21$2ed16190$15327e82@pyrimidine>
Just thought of something...
You had mentioned using a stripped-down version of Bio::Taxonomy::Node
previously, which led to a bit of contention. One way to make everybody
happy would be to create an interface class that contains the basic shared
methods (Bio::Taxonomy::NodeI), then have the currently-named
Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or
something similar) implement those methods along with the current methods.
Another class (your stripped down version, which could then be
Bio::Taxonomy::Node) would also implement whatever base class methods were
needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could
use either object type where required.
|------Node
NodeI----|
|------Species
Another option would be to have Bio::Taxonomy::Node itself stripped down,
then have another class (Bio::Taxonomy::Species) inherit methods from it and
also implement additional methods (genus(), species(), etc).
Node----Species
Would something like that be feasible? I favor the interface version as it
sticks with the interface-implementation design that Bioperl has been
migrating towards:
http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design
This would also help out with the whole Bio::Species issue; just have
Bio::Taxonomy::Species replace it.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Thursday, July 20, 2006 8:49 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Sendu Bala wrote:
> >
> > Bio::DB::Taxonomy::flatfile
> >
> > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
> > always being sent directly to Bio::Taxonomy::Node->new(-name =>
> > $untouched) or the $node->classification() array. Previously, a species
> > node would have its name converted from 'Homo sapiens' to 'sapiens', but
> > the conversion mangled very badly certain other species names.
> [...]
> > Bio::DB::Taxonomy::entrez
> >
> > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
> > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
> > $untouched) or the $node->classification() array. Previously, a species
> > node would have its name converted from 'Homo sapiens' to 'sapiens', but
> > the conversion mangled very badly certain other species names.
>
> Oops. In both cases the scientific name has ' (class)' removed from it,
> but the original name (with ' (class)') is stored as one of the common
> names.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Thu Jul 20 13:31:42 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 12:31:42 -0500
Subject: [Bioperl-l] Blast Output Parsing
In-Reply-To:
Message-ID: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine>
Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object.
You can then use Bio::AlignIO to generate the alignment output if needed, or
use the Bio::SimpleAlign methods to get what you want.
http://www.bioperl.org/wiki/HOWTO:Beginners
http://www.bioperl.org/wiki/HOWTO:SearchIO
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign
.html
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Prabu R
> Sent: Thursday, July 20, 2006 11:02 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Blast Output Parsing
>
> Dear All!
>
> I am now trying to parse a Blast output using PERL.
>
> I have to extract each alignment and have to parse the alignment. I mean,
> I
> have to check whether a particular part of the given sequence got aligned
> 100%.
>
> Anybody please tell me what module in PERL I have to use for getting this.
>
> I've tried Bio::SearchIO. But I didnt get any method to get the
> alignment.
>
> Kindly help.
>
> Thanks,
> R. Prabu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Thu Jul 20 13:53:03 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 18:53:03 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <002901c6ac1e$66ea3820$15327e82@pyrimidine>
References: <002901c6ac1e$66ea3820$15327e82@pyrimidine>
Message-ID: <44BFC2FF.3030704@sendu.me.uk>
Chris Fields wrote:
>
> I still plan on switching Bio::DB::Taxonomy::entrez to use
> Bio::DB::EUtilities at some point but probably won't get around to it until
> August;
If I may make two feature requests (you've probably already done them,
if so apologies)? a) Automatically enforce the 3second wait rule when
querying via the ncbi website. b) Automatically cache results locally in
a reasonable way, such that repeated queries aiming to get the same
result don't have to go via the website.
> Anyway, I could add this in then base class Bio::DB::Taxonomy directly so
> one could used the retrieved TaxIDs for flat-file or entrez searches; this
> requires, of course, access to the remote Entrez database (it would use
> ELink). Would that be of interest?
Sorry, I don't really understand this paragraph. I'm unable to parse
'...then base class Bio::DB::Taxonomy directly so...', for starters.
>> Bio::Taxonomy::Node
>> -------------------
>
> ...
>
>> species() and genus() issue a warning when you try to use them on a node
>> that isn't of rank 'species' (since they interact with the
>> classification array and not names('method') like the other similar
>> methods).
>
> I would just have genus() and species() issue warnings if they aren't set to
> a particular value. So, if the current node is at the genus rank, genus()
> will be set but species() won't be. And no need to do additional checking!
The problem is, genus() and species() are special cases that aren't
normally directly set. They get their values from the classification
array: genus() returns (classification())[1] and species() returns
(classification())[0]. They set the same values. Doing this is only sane
(though is still likely to be wrong, given that there can be ranks
between species and genus) when the node is of rank 'species', hence the
warnings.
I imagine this is to work with pesky file formats like genbank, so I
can't really change anything here without major overhaul. And my plans
for overhaul involve getting rid of genus() and species(), so I'll just
leave them be for now.
Anyway, thanks for your comments and input into this thread! It's much
appreciated.
From bix at sendu.me.uk Thu Jul 20 13:55:56 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 18:55:56 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <002a01c6ac21$2ed16190$15327e82@pyrimidine>
References: <002a01c6ac21$2ed16190$15327e82@pyrimidine>
Message-ID: <44BFC3AC.8010704@sendu.me.uk>
Chris Fields wrote:
> Just thought of something...
>
> You had mentioned using a stripped-down version of Bio::Taxonomy::Node
> previously, which led to a bit of contention. One way to make everybody
> happy would be to create an interface class that contains the basic shared
> methods (Bio::Taxonomy::NodeI), then have the currently-named
> Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or
> something similar) implement those methods along with the current methods.
> Another class (your stripped down version, which could then be
> Bio::Taxonomy::Node) would also implement whatever base class methods were
> needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could
> use either object type where required.
>
> |------Node
> NodeI----|
> |------Species
[...]
> I favor the interface version as it
> sticks with the interface-implementation design that Bioperl has been
> migrating towards:
>
> http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design
>
> This would also help out with the whole Bio::Species issue; just have
> Bio::Taxonomy::Species replace it.
Yes, this sounds good to me. Should I still wait until Jason/elders are
able to comment before I start exploring this avenue?
From cjfields at uiuc.edu Thu Jul 20 14:21:48 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 13:21:48 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BFC3AC.8010704@sendu.me.uk>
Message-ID: <000601c6ac29$5d533a90$15327e82@pyrimidine>
I would say go ahead, why not? This would likely lead to the eventual
deprecation of Bio::Species, which was in the cards anyway.
The only problem I can foresee is which class to use with
Bio::DB::Taxonomy*? I guess one could settle on one class by default and
have the option to use another Bio::Taxonomy::NodeI-implementing class if
you wanted more data/methods available...
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Thursday, July 20, 2006 12:56 PM
> To: bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Chris Fields wrote:
> > Just thought of something...
> >
> > You had mentioned using a stripped-down version of Bio::Taxonomy::Node
> > previously, which led to a bit of contention. One way to make everybody
> > happy would be to create an interface class that contains the basic
> shared
> > methods (Bio::Taxonomy::NodeI), then have the currently-named
> > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or
> > something similar) implement those methods along with the current
> methods.
> > Another class (your stripped down version, which could then be
> > Bio::Taxonomy::Node) would also implement whatever base class methods
> were
> > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you
> could
> > use either object type where required.
> >
> > |------Node
> > NodeI----|
> > |------Species
> [...]
> > I favor the interface version as it
> > sticks with the interface-implementation design that Bioperl has been
> > migrating towards:
> >
> > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design
> >
> > This would also help out with the whole Bio::Species issue; just have
> > Bio::Taxonomy::Species replace it.
>
> Yes, this sounds good to me. Should I still wait until Jason/elders are
> able to comment before I start exploring this avenue?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Thu Jul 20 14:24:19 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Jul 2006 14:24:19 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BFC3AC.8010704@sendu.me.uk>
References: <002a01c6ac21$2ed16190$15327e82@pyrimidine>
<44BFC3AC.8010704@sendu.me.uk>
Message-ID:
On Jul 20, 2006, at 1:55 PM, Sendu Bala wrote:
>
> Yes, this sounds good to me. Should I still wait until Jason/elders
> are
> able to comment before I start exploring this avenue?
Unless you're afraid that your suggestions are going too wild for our
palate please do go ahead. The joy of CVS is we can always go back.
For my part, I just haven't been able to keep up with the flurry of
long emails ... I'll have to do some extensive bedtime reading (and
then writing ;) soon I guess :-)
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From saunders at uchicago.edu Thu Jul 20 17:47:08 2006
From: saunders at uchicago.edu (Matthew A. Saunders)
Date: Thu, 20 Jul 2006 16:47:08 -0500 (CDT)
Subject: [Bioperl-l] installing bioperl
Message-ID:
Dear Bioperl representative,
I have been trying to install bioperl (in order to ultimately run some
Ensembl APIs) but I seem to be having some problems with the
bioperl installation.
I have followed the installation directions and I get to the last steps of
the "make" process, yet this stage fails with the error message below.
Can you possibly tell me what is the problem. I am not sure that I
understand the command "make", but I think that it requires that there be
a file named "makefile" in the given folder, when I look in my newly
formed "bioperl-1.4" folder there is no "makefile" in there. Perhaps that
is a problem. If so, how might I rectify the matter?
Thanks!
Matt
************************************************************* . .
.
Enjoy the rest of bioperl, which you can use after going 'make install'
Checking if your kit is complete...
Looks good
/usr/bin/perl: symbol lookup error:
/usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so:
undefined symbol: db_version
Running make test
Make had some problems, maybe interrupted? Won't test
Running make install
Make had some problems, maybe interrupted? Won't install
***************************************************************
-----------------------------------------------------
Matthew A. Saunders
UNCF-MERCK Postdoctoral Research Fellow
Dept. of Ecology and Evolution
University of Chicago
(773)834-3964
Skype: mattsaunders555
http://home.uchicago.edu/~saunders
-------------------------------------------------------
From saunders at uchicago.edu Thu Jul 20 18:01:53 2006
From: saunders at uchicago.edu (Matthew A. Saunders)
Date: Thu, 20 Jul 2006 17:01:53 -0500 (CDT)
Subject: [Bioperl-l] installing bioperl
In-Reply-To:
References:
Message-ID:
In continuation to my described problem, I have just installed the
bioperl-run file from the .tar.gz format and that was successful through
the "perl Makefile.PL" and the "make" & "make test" phases.
It is the "bioperl core" file that is still giving me the problems
described below.
Thanks!
Matt
********************************
On Thu, 20 Jul 2006, Matthew A. Saunders wrote:
> Dear Bioperl representative,
>
> I have been trying to install bioperl (in order to ultimately run some
> Ensembl APIs) but I seem to be having some problems with the bioperl
> installation.
>
> I have followed the installation directions and I get to the last steps of
> the "make" process, yet this stage fails with the error message below. Can
> you possibly tell me what is the problem. I am not sure that I understand
> the command "make", but I think that it requires that there be a file named
> "makefile" in the given folder, when I look in my newly formed "bioperl-1.4"
> folder there is no "makefile" in there. Perhaps that is a problem. If so,
> how might I rectify the matter?
>
> Thanks!
>
> Matt
>
>
> ************************************************************* . . .
> Enjoy the rest of bioperl, which you can use after going 'make install'
>
> Checking if your kit is complete...
> Looks good
> /usr/bin/perl: symbol lookup error:
> /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so:
> undefined symbol: db_version
> Running make test
> Make had some problems, maybe interrupted? Won't test
> Running make install
> Make had some problems, maybe interrupted? Won't install
> ***************************************************************
>
>
>
> -----------------------------------------------------
> Matthew A. Saunders
> UNCF-MERCK Postdoctoral Research Fellow
>
> Dept. of Ecology and Evolution
> University of Chicago
> (773)834-3964
> Skype: mattsaunders555
> http://home.uchicago.edu/~saunders
> -------------------------------------------------------
>
>
From bix at sendu.me.uk Thu Jul 20 18:47:33 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 20 Jul 2006 23:47:33 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine>
Message-ID: <44C00805.7090403@sendu.me.uk>
Chris Fields wrote:
> As for caching,
> do you mean caching of the tax information or the sequence ID information?
Anything you get from entrez.
> Caching of tax information would be great, but how would you go about it? I
> can see how it would be easy to have a cache for the flatfile using a local
> index, but not so much for XML data retrieved from Entrez (a
> continually-appended local file, maybe, with a n accompanying index file?).
I didn't actually mean a stored file (but that would be possible with a
tied hash or something: DB_File, just like flatfile), but an in-memory
one for use during the course of program execution. Stored file would
probably be dangerous because you wouldn't know if the data has become
stale or not - and checking to see if it wasn't would defeat the point.
>> The problem is, genus() and species() are special cases that aren't
>> normally directly set. They get their values from the classification
>> array: genus() returns (classification())[1] and species() returns
>> (classification())[0]. They set the same values. Doing this is only sane
>> (though is still likely to be wrong, given that there can be ranks
>> between species and genus) when the node is of rank 'species', hence the
>> warnings.
>>
>> I imagine this is to work with pesky file formats like genbank, so I
>> can't really change anything here without major overhaul. And my plans
>> for overhaul involve getting rid of genus() and species(), so I'll just
>> leave them be for now.
>
> This would all depend on where the information came from; if the information
> came from the Entrez XML element data:
>
[snip]
>
> The subspecies(), genus(), and species() could all be set from this instead
> of the classification array. The problem lies then with the flatfile data
> and how it would be parsed out, if that's at all possible with the flatfile
> data. If not, I see why you would rather have this return a stripped-down
> Bio::Taxonomy::Node object.
>
> I would have to look at how everything is indexed in
> Bio::DB::Taxonomy::entrez, but I think it's feasible.
entrez already parses through LineageEx to build the classification
array. flatfile walks up all the parents to do the same. Having the
information isn't the issue. We have the information. The methods
genus() and species() need to work with the genbank fileformat, that is
the problem.
From MEC at stowers-institute.org Thu Jul 20 18:40:55 2006
From: MEC at stowers-institute.org (Cook, Malcolm)
Date: Thu, 20 Jul 2006 17:40:55 -0500
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
Message-ID:
Rohan,
'snp/human/human_snp' is the database name you need to use to blast into
human snp database at NCBI
See the following document for the full list (which link was provided to
me via personal correspondace with NCBI helpdesk). Very useful...
Hmm, looming again, there appear now to be two versions:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last
updated 2/7/2006)
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli
st.html (last uypdated 5/29/2006)
Neither are linked to by any other document on the internet (google sez)
including anywhere else at NCBI. Go figure. It should be IMHO since
this info is nowhere else collected.
Of course it may be out of date, but it always has got me through.
Good luck
Malcolm Cook - mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO USA
>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
>Sent: Monday, July 17, 2006 4:26 PM
>To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome
>
>Okay, I think I may know what's going on a little more now
>with NCBI's BLAST
>interface. Looks like any NCBI BLAST query must use the
>default URL and so
>must set up to proper GET/PUT commands to retrieve everything
>correctly.
>
>Here's the API description for it all:
>
>http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
>
>You could try setting the database to 'snp' or something along
>those lines
>instead of 'nr'; or you could see what the name of the
>database is when you
>use the web form and try setting it to that. According to
>this page, this
>should be possible:
>
>http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio
>n.SearchdbSNP
>_test._Search_dbSNP_Using_B
>
>The Entrez Query limit was a recommendation for limiting your
>search to a
>set of sequences for human, for instance.
>
>I'll try looking into it a bit more but I'm pretty busy. If you find
>anything out you should probably post it here .
>
>Chris
>
>> Hi Chris,
>>
>> 1. I have tried changing the database to snp or dbSNP but
>neither works.
>> It
>> seems that depending on which type of blast you use(ie, Genome Blast,
>> Blast SNP,
>> normal blast such as blastn, etc...) you see a different listing of
>> databases
>> available for querys. Since you mention that the Blast page I see was
>> generated
>> by Genome, where could I go to see a complete listing of
>databases I can
>> query??
>> Or if you knew off hand which database to search if I only
>wanted dbSNP
>> hits?
>>
>> 2. You also mention, I can limit the search by using Entrez
>terms. Do you
>> mean
>> like:
>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc';
>> where 'abc' is the name of the subject with which you would
>only like to
>> see
>> result of. For example if you put it as 'Homo
>sapiens[Organism]' then only
>> human
>> sequences would be in hit lists.
>> If this is what you mean, what would I change it to, to see
>only hits from
>> dbSNP?
>>
>> Thanks for the ongoing help,
>>
>> Rohan
>>
>> Quoting Chris Fields :
>>
>> > I added a method to RemoteBlast in bioperl-live (CVS) if
>you want to
>> play
>> > with changing the URL. I have been thinking about doing
>this for a bit
>> now
>> > but I already see problems.
>> >
>> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page
>> (note
>> > the differences in the URL) but a user-friendly request
>page, generated
>> on
>> > the fly by Genome, to submit BLAST requests for the
>relevant database.
>> So
>> > changing the URL will not work (even by adding extra
>parameters); you
>> only
>> > get the original HTML web page.
>> >
>> > You could try changing the database or limiting the search using an
>> Entrez
>> > term (which you should be able to include in the request,
>probably by
>> adding
>> > it to the HEADER).
>> >
>> > Chris
>> >
>> > > -----Original Message-----
>> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> > > bounces at lists.open-bio.org] On Behalf Of
>> vrramnar at student.cs.uwaterloo.ca
>> > > Sent: Thursday, July 13, 2006 5:39 PM
>> > > To: bioperl-l at lists.open-bio.org
>> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome
>> > >
>> > >
>> > > Hello Again,
>> > >
>> > > I have another question regarding Remote blast but this
>time using
>> Genome
>> > > Blast.
>> > >
>> > > Here is the link:
>> > >
>> > >
>>
>http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
>> > >
>> > > which again uses the main Blast web site:
>> > >
>> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
>> > >
>> > > Again I am not sure what to add or what HEADER
>information to change
>> > > within my
>> > > script.
>> > >
>> > > Here is my program, which was the same as the last email:
>> > >
>> > > #!/usr/bin/perl -w
>> > >
>> > > use Bio::Perl;
>> > > use Bio::Tools::Run::RemoteBlast;
>> > >
>> > > my $prog = "blastn";
>> > > my $db = "refseq_genomic";
>> > > my $e_val = 0.01;
>> > >
>> > > my @params = ( '-prog' => $prog,
>> > > '-data' => $db,
>> > > '-expect' => $e_val);
>> > >
>> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
>> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'}
>= '????'; <--
>> ---
>> > > what
>> > > do I put here
>> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} =
>'????'; <--- Do I
>> need
>> > > to add
>> > > any other values to the form inputs
>> > >
>> > > $factory->submit_blast("blast.in");
>> > > $v = 1;
>> > >
>> > > while (my @rids = $factory->each_rid)
>> > > { foreach my $rid ( @rids )
>> > > { my $rc = $factory->retrieve_blast($rid);
>> > > if( !ref($rc) )
>> > > { if( $rc < 0 )
>> > > { $factory->remove_rid($rid);
>> > > }
>> > > print STDERR "." if ( $v > 0 );
>> > > sleep 5;
>> > > }
>> > > else
>> > > { my $result = $rc->next_result();
>> > > my $filename = $result->query_name()."\.out";
>> > > $factory->save_output($filename);
>> > > $factory->remove_rid($rid);
>> > > print "\nQuery Name: ", $result->query_name(), "\n";
>> > > }
>> > > }
>> > > }
>> > >
>> > >
>> > > Both of my questions are very similiar as in I know how
>to use remote
>> > > blast but
>> > > not sure what to change to access the specific blast I want.
>> > >
>> > > Again, any help would be very appreciated!!
>> > >
>> > > Rohan
>> > >
>> > >
>> > >
>> > > ----------------------------------------
>> > > This mail sent through www.mywaterloo.ca
>> > > _______________________________________________
>> > > Bioperl-l mailing list
>> > > Bioperl-l at lists.open-bio.org
>> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>>
>>
>>
>>
>> ----------------------------------------
>> This mail sent through www.mywaterloo.ca
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From cjfields at uiuc.edu Thu Jul 20 19:01:02 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 18:01:02 -0500
Subject: [Bioperl-l] installing bioperl
In-Reply-To:
References:
Message-ID: <68C6025D-A9FE-47F0-905C-28B79C4B843A@uiuc.edu>
Did you run
perl Makefile.PL
make
make install
'perl Makefile.PL' generates the Makefile.
Something screwy with DB_File, apparently, is also going on.
> /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/
> DB_File.so:
Try updating or reinstalling DB_File.
Chris
On Jul 20, 2006, at 4:47 PM, Matthew A. Saunders wrote:
> Dear Bioperl representative,
>
> I have been trying to install bioperl (in order to ultimately run some
> Ensembl APIs) but I seem to be having some problems with the
> bioperl installation.
>
> I have followed the installation directions and I get to the last
> steps of
> the "make" process, yet this stage fails with the error message below.
> Can you possibly tell me what is the problem. I am not sure that I
> understand the command "make", but I think that it requires that
> there be
> a file named "makefile" in the given folder, when I look in my newly
> formed "bioperl-1.4" folder there is no "makefile" in there.
> Perhaps that
> is a problem. If so, how might I rectify the matter?
>
> Thanks!
>
> Matt
>
>
> ************************************************************* . .
> .
> Enjoy the rest of bioperl, which you can use after going 'make
> install'
>
> Checking if your kit is complete...
> Looks good
> /usr/bin/perl: symbol lookup error:
> /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/
> DB_File.so:
> undefined symbol: db_version
> Running make test
> Make had some problems, maybe interrupted? Won't test
> Running make install
> Make had some problems, maybe interrupted? Won't install
> ***************************************************************
>
>
>
> -----------------------------------------------------
> Matthew A. Saunders
> UNCF-MERCK Postdoctoral Research Fellow
>
> Dept. of Ecology and Evolution
> University of Chicago
> (773)834-3964
> Skype: mattsaunders555
> http://home.uchicago.edu/~saunders
> -------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Thu Jul 20 19:02:08 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 18:02:08 -0500
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
In-Reply-To:
References:
Message-ID:
Nice to know! I'll add this to the wiki.
Chris
On Jul 20, 2006, at 5:40 PM, Cook, Malcolm wrote:
> Rohan,
>
> 'snp/human/human_snp' is the database name you need to use to blast
> into
> human snp database at NCBI
>
> See the following document for the full list (which link was
> provided to
> me via personal correspondace with NCBI helpdesk). Very useful...
>
> Hmm, looming again, there appear now to be two versions:
>
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last
> updated 2/7/2006)
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/
> remote_accessible_blastdbli
> st.html (last uypdated 5/29/2006)
>
> Neither are linked to by any other document on the internet (google
> sez)
> including anywhere else at NCBI. Go figure. It should be IMHO since
> this info is nowhere else collected.
>
> Of course it may be out of date, but it always has got me through.
>
> Good luck
>
> Malcolm Cook - mec at stowers-institute.org - 816-926-4449
> Database Applications Manager - Bioinformatics
> Stowers Institute for Medical Research - Kansas City, MO USA
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org
>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris
>> Fields
>> Sent: Monday, July 17, 2006 4:26 PM
>> To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome
>>
>> Okay, I think I may know what's going on a little more now
>> with NCBI's BLAST
>> interface. Looks like any NCBI BLAST query must use the
>> default URL and so
>> must set up to proper GET/PUT commands to retrieve everything
>> correctly.
>>
>> Here's the API description for it all:
>>
>> http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
>>
>> You could try setting the database to 'snp' or something along
>> those lines
>> instead of 'nr'; or you could see what the name of the
>> database is when you
>> use the web form and try setting it to that. According to
>> this page, this
>> should be possible:
>>
>> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio
>> n.SearchdbSNP
>> _test._Search_dbSNP_Using_B
>>
>> The Entrez Query limit was a recommendation for limiting your
>> search to a
>> set of sequences for human, for instance.
>>
>> I'll try looking into it a bit more but I'm pretty busy. If you find
>> anything out you should probably post it here .
>>
>> Chris
>>
>>> Hi Chris,
>>>
>>> 1. I have tried changing the database to snp or dbSNP but
>> neither works.
>>> It
>>> seems that depending on which type of blast you use(ie, Genome
>>> Blast,
>>> Blast SNP,
>>> normal blast such as blastn, etc...) you see a different listing of
>>> databases
>>> available for querys. Since you mention that the Blast page I see
>>> was
>>> generated
>>> by Genome, where could I go to see a complete listing of
>> databases I can
>>> query??
>>> Or if you knew off hand which database to search if I only
>> wanted dbSNP
>>> hits?
>>>
>>> 2. You also mention, I can limit the search by using Entrez
>> terms. Do you
>>> mean
>>> like:
>>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc';
>>> where 'abc' is the name of the subject with which you would
>> only like to
>>> see
>>> result of. For example if you put it as 'Homo
>> sapiens[Organism]' then only
>>> human
>>> sequences would be in hit lists.
>>> If this is what you mean, what would I change it to, to see
>> only hits from
>>> dbSNP?
>>>
>>> Thanks for the ongoing help,
>>>
>>> Rohan
>>>
>>> Quoting Chris Fields :
>>>
>>>> I added a method to RemoteBlast in bioperl-live (CVS) if
>> you want to
>>> play
>>>> with changing the URL. I have been thinking about doing
>> this for a bit
>>> now
>>>> but I already see problems.
>>>>
>>>> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page
>>> (note
>>>> the differences in the URL) but a user-friendly request
>> page, generated
>>> on
>>>> the fly by Genome, to submit BLAST requests for the
>> relevant database.
>>> So
>>>> changing the URL will not work (even by adding extra
>> parameters); you
>>> only
>>>> get the original HTML web page.
>>>>
>>>> You could try changing the database or limiting the search using an
>>> Entrez
>>>> term (which you should be able to include in the request,
>> probably by
>>> adding
>>>> it to the HEADER).
>>>>
>>>> Chris
>>>>
>>>>> -----Original Message-----
>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>>> bounces at lists.open-bio.org] On Behalf Of
>>> vrramnar at student.cs.uwaterloo.ca
>>>>> Sent: Thursday, July 13, 2006 5:39 PM
>>>>> To: bioperl-l at lists.open-bio.org
>>>>> Subject: [Bioperl-l] Remote Blast - Blast Human Genome
>>>>>
>>>>>
>>>>> Hello Again,
>>>>>
>>>>> I have another question regarding Remote blast but this
>> time using
>>> Genome
>>>>> Blast.
>>>>>
>>>>> Here is the link:
>>>>>
>>>>>
>>>
>> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?
>> taxid=9606
>>>>>
>>>>> which again uses the main Blast web site:
>>>>>
>>>>> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
>>>>>
>>>>> Again I am not sure what to add or what HEADER
>> information to change
>>>>> within my
>>>>> script.
>>>>>
>>>>> Here is my program, which was the same as the last email:
>>>>>
>>>>> #!/usr/bin/perl -w
>>>>>
>>>>> use Bio::Perl;
>>>>> use Bio::Tools::Run::RemoteBlast;
>>>>>
>>>>> my $prog = "blastn";
>>>>> my $db = "refseq_genomic";
>>>>> my $e_val = 0.01;
>>>>>
>>>>> my @params = ( '-prog' => $prog,
>>>>> '-data' => $db,
>>>>> '-expect' => $e_val);
>>>>>
>>>>> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
>>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'}
>> = '????'; <--
>>> ---
>>>>> what
>>>>> do I put here
>>>>> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} =
>> '????'; <--- Do I
>>> need
>>>>> to add
>>>>> any other values to the form inputs
>>>>>
>>>>> $factory->submit_blast("blast.in");
>>>>> $v = 1;
>>>>>
>>>>> while (my @rids = $factory->each_rid)
>>>>> { foreach my $rid ( @rids )
>>>>> { my $rc = $factory->retrieve_blast($rid);
>>>>> if( !ref($rc) )
>>>>> { if( $rc < 0 )
>>>>> { $factory->remove_rid($rid);
>>>>> }
>>>>> print STDERR "." if ( $v > 0 );
>>>>> sleep 5;
>>>>> }
>>>>> else
>>>>> { my $result = $rc->next_result();
>>>>> my $filename = $result->query_name()."\.out";
>>>>> $factory->save_output($filename);
>>>>> $factory->remove_rid($rid);
>>>>> print "\nQuery Name: ", $result->query_name(), "\n";
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> Both of my questions are very similiar as in I know how
>> to use remote
>>>>> blast but
>>>>> not sure what to change to access the specific blast I want.
>>>>>
>>>>> Again, any help would be very appreciated!!
>>>>>
>>>>> Rohan
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------------
>>>>> This mail sent through www.mywaterloo.ca
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>>
>>>
>>> ----------------------------------------
>>> This mail sent through www.mywaterloo.ca
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:07:15 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Thu, 20 Jul 2006 19:07:15 -0400
Subject: [Bioperl-l] Remote Blast - Blast Human Genome
In-Reply-To:
References:
Message-ID: <1153436835.44c00ca39f2ee@www.nexusmail.uwaterloo.ca>
Hi Malcolm,
Thanks for the help, I actually figured this out today the same way you did
through discussions with NCBI help deskng.
He mentioned the main site is:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/
But specifically:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html
So all you would need to do while using remoteblast is set your $db to one of
the following:
snp/human_9606/human_9606 Human SNPs
snp/human_9606/rs_ch1 Human chr 1 SNPs
snp/human_9606/rs_ch10 Human chr 10 SNPs
snp/human_9606/rs_ch11 Human chr 11 SNPs
snp/human_9606/rs_ch12 Human chr 12 SNPs
snp/human_9606/rs_ch13 Human chr 13 SNPs
snp/human_9606/rs_ch14 Human chr 14 SNPs
snp/human_9606/rs_ch15 Human chr 15 SNPs
snp/human_9606/rs_ch16 Human chr 16 SNPs
snp/human_9606/rs_ch17 Human chr 17 SNPs
snp/human_9606/rs_ch18 Human chr 18 SNPs
snp/human_9606/rs_ch19 Human chr 19 SNPs
snp/human_9606/rs_ch2 Human chr 2 SNPs
snp/human_9606/rs_ch20 Human chr 20 SNPs
snp/human_9606/rs_ch21 Human chr 21 SNPs
snp/human_9606/rs_ch22 Human chr 22 SNPs
snp/human_9606/rs_ch3 Human chr 3 SNPs
snp/human_9606/rs_ch4 Human chr 4 SNPs
snp/human_9606/rs_ch5 Human chr 5 SNPs
snp/human_9606/rs_ch6 Human chr 6 SNPs
snp/human_9606/rs_ch7 Human chr 7 SNPs
snp/human_9606/rs_ch8 Human chr 8 SNPs
snp/human_9606/rs_ch9 Human chr 9 SNPs
snp/human_9606/rs_chMT Human chr Mitochondrial SNPs
snp/human_9606/rs_chMulti Human SNPs mapped to multiple locations
snp/human_9606/rs_chNotOn Human SNPs not mapped
snp/human_9606/rs_chUn Human SNPs mapped to unplaced contigs
snp/human_9606/rs_chX Human chr x SNPs
snp/human_9606/rs_chY Human chr y SNPs
The web site has a more complete list of all other databases available using the
remoteblast module.
Rohan
Quoting "Cook, Malcolm" :
> Rohan,
>
> 'snp/human/human_snp' is the database name you need to use to blast into
> human snp database at NCBI
>
> See the following document for the full list (which link was provided to
> me via personal correspondace with NCBI helpdesk). Very useful...
>
> Hmm, looming again, there appear now to be two versions:
>
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last
> updated 2/7/2006)
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli
> st.html (last uypdated 5/29/2006)
>
> Neither are linked to by any other document on the internet (google sez)
> including anywhere else at NCBI. Go figure. It should be IMHO since
> this info is nowhere else collected.
>
> Of course it may be out of date, but it always has got me through.
>
> Good luck
>
> Malcolm Cook - mec at stowers-institute.org - 816-926-4449
> Database Applications Manager - Bioinformatics
> Stowers Institute for Medical Research - Kansas City, MO USA
>
>
>
> >-----Original Message-----
> >From: bioperl-l-bounces at lists.open-bio.org
> >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
> >Sent: Monday, July 17, 2006 4:26 PM
> >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org
> >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome
> >
> >Okay, I think I may know what's going on a little more now
> >with NCBI's BLAST
> >interface. Looks like any NCBI BLAST query must use the
> >default URL and so
> >must set up to proper GET/PUT commands to retrieve everything
> >correctly.
> >
> >Here's the API description for it all:
> >
> >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
> >
> >You could try setting the database to 'snp' or something along
> >those lines
> >instead of 'nr'; or you could see what the name of the
> >database is when you
> >use the web form and try setting it to that. According to
> >this page, this
> >should be possible:
> >
> >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio
> >n.SearchdbSNP
> >_test._Search_dbSNP_Using_B
> >
> >The Entrez Query limit was a recommendation for limiting your
> >search to a
> >set of sequences for human, for instance.
> >
> >I'll try looking into it a bit more but I'm pretty busy. If you find
> >anything out you should probably post it here .
> >
> >Chris
> >
> >> Hi Chris,
> >>
> >> 1. I have tried changing the database to snp or dbSNP but
> >neither works.
> >> It
> >> seems that depending on which type of blast you use(ie, Genome Blast,
> >> Blast SNP,
> >> normal blast such as blastn, etc...) you see a different listing of
> >> databases
> >> available for querys. Since you mention that the Blast page I see was
> >> generated
> >> by Genome, where could I go to see a complete listing of
> >databases I can
> >> query??
> >> Or if you knew off hand which database to search if I only
> >wanted dbSNP
> >> hits?
> >>
> >> 2. You also mention, I can limit the search by using Entrez
> >terms. Do you
> >> mean
> >> like:
> >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc';
> >> where 'abc' is the name of the subject with which you would
> >only like to
> >> see
> >> result of. For example if you put it as 'Homo
> >sapiens[Organism]' then only
> >> human
> >> sequences would be in hit lists.
> >> If this is what you mean, what would I change it to, to see
> >only hits from
> >> dbSNP?
> >>
> >> Thanks for the ongoing help,
> >>
> >> Rohan
> >>
> >> Quoting Chris Fields :
> >>
> >> > I added a method to RemoteBlast in bioperl-live (CVS) if
> >you want to
> >> play
> >> > with changing the URL. I have been thinking about doing
> >this for a bit
> >> now
> >> > but I already see problems.
> >> >
> >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page
> >> (note
> >> > the differences in the URL) but a user-friendly request
> >page, generated
> >> on
> >> > the fly by Genome, to submit BLAST requests for the
> >relevant database.
> >> So
> >> > changing the URL will not work (even by adding extra
> >parameters); you
> >> only
> >> > get the original HTML web page.
> >> >
> >> > You could try changing the database or limiting the search using an
> >> Entrez
> >> > term (which you should be able to include in the request,
> >probably by
> >> adding
> >> > it to the HEADER).
> >> >
> >> > Chris
> >> >
> >> > > -----Original Message-----
> >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> > > bounces at lists.open-bio.org] On Behalf Of
> >> vrramnar at student.cs.uwaterloo.ca
> >> > > Sent: Thursday, July 13, 2006 5:39 PM
> >> > > To: bioperl-l at lists.open-bio.org
> >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome
> >> > >
> >> > >
> >> > > Hello Again,
> >> > >
> >> > > I have another question regarding Remote blast but this
> >time using
> >> Genome
> >> > > Blast.
> >> > >
> >> > > Here is the link:
> >> > >
> >> > >
> >>
> >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606
> >> > >
> >> > > which again uses the main Blast web site:
> >> > >
> >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi
> >> > >
> >> > > Again I am not sure what to add or what HEADER
> >information to change
> >> > > within my
> >> > > script.
> >> > >
> >> > > Here is my program, which was the same as the last email:
> >> > >
> >> > > #!/usr/bin/perl -w
> >> > >
> >> > > use Bio::Perl;
> >> > > use Bio::Tools::Run::RemoteBlast;
> >> > >
> >> > > my $prog = "blastn";
> >> > > my $db = "refseq_genomic";
> >> > > my $e_val = 0.01;
> >> > >
> >> > > my @params = ( '-prog' => $prog,
> >> > > '-data' => $db,
> >> > > '-expect' => $e_val);
> >> > >
> >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params);
> >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'}
> >= '????'; <--
> >> ---
> >> > > what
> >> > > do I put here
> >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} =
> >'????'; <--- Do I
> >> need
> >> > > to add
> >> > > any other values to the form inputs
> >> > >
> >> > > $factory->submit_blast("blast.in");
> >> > > $v = 1;
> >> > >
> >> > > while (my @rids = $factory->each_rid)
> >> > > { foreach my $rid ( @rids )
> >> > > { my $rc = $factory->retrieve_blast($rid);
> >> > > if( !ref($rc) )
> >> > > { if( $rc < 0 )
> >> > > { $factory->remove_rid($rid);
> >> > > }
> >> > > print STDERR "." if ( $v > 0 );
> >> > > sleep 5;
> >> > > }
> >> > > else
> >> > > { my $result = $rc->next_result();
> >> > > my $filename = $result->query_name()."\.out";
> >> > > $factory->save_output($filename);
> >> > > $factory->remove_rid($rid);
> >> > > print "\nQuery Name: ", $result->query_name(), "\n";
> >> > > }
> >> > > }
> >> > > }
> >> > >
> >> > >
> >> > > Both of my questions are very similiar as in I know how
> >to use remote
> >> > > blast but
> >> > > not sure what to change to access the specific blast I want.
> >> > >
> >> > > Again, any help would be very appreciated!!
> >> > >
> >> > > Rohan
> >> > >
> >> > >
> >> > >
> >> > > ----------------------------------------
> >> > > This mail sent through www.mywaterloo.ca
> >> > > _______________________________________________
> >> > > Bioperl-l mailing list
> >> > > Bioperl-l at lists.open-bio.org
> >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >> >
> >>
> >>
> >>
> >>
> >> ----------------------------------------
> >> This mail sent through www.mywaterloo.ca
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at lists.open-bio.org
> >http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
----------------------------------------
This mail sent through www.mywaterloo.ca
From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:18:27 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Thu, 20 Jul 2006 19:18:27 -0400
Subject: [Bioperl-l] SNP reference file download
Message-ID: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca>
Hello All,
I was wondering if anyone knew how to download an entire SNP reference file from
NCBI?? Or even downloading the sequence data for a particular SNP.
I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when referring
to NM_##### but when I try to access rs###### files I am unsure of what Bio::DB
to point to, if there is one.
For example, if I had the accession number: rs4986950 How could I retrieve NCBI's
entire reference file for this SNP record OR just the SNP sequence relating to
this accession number.
Any help on this subject would greatly be appreciated,
Rohan
----------------------------------------
This mail sent through www.mywaterloo.ca
From cjfields at uiuc.edu Fri Jul 21 00:51:30 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 20 Jul 2006 23:51:30 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C00805.7090403@sendu.me.uk>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine>
<44C00805.7090403@sendu.me.uk>
Message-ID: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
> I didn't actually mean a stored file (but that would be possible
> with a
> tied hash or something: DB_File, just like flatfile), but an in-memory
> one for use during the course of program execution. Stored file would
> probably be dangerous because you wouldn't know if the data has become
> stale or not - and checking to see if it wasn't would defeat the
> point.
Okay, that wouldn't be a problem. I currently use in-memory caches
to hold NCBI history information and ELink information for
EUtilities. It would just a matter of doing the same for
Bio::DB::Taxonomy.
...
> entrez already parses through LineageEx to build the classification
> array. flatfile walks up all the parents to do the same. Having the
> information isn't the issue. We have the information. The methods
> genus() and species() need to work with the genbank fileformat,
> that is
> the problem.
The original purpose for Bio::Species was a simple object to hold
taxonomic information. This object was then used in an attempt to
hold the basic organism information (scientific name, common name,
lineage information, etc) contained in a RichSeq file, like GenBank,
EMBL, SwissProt, etc. The problem: trying to determine which term
in the lineage corresponds to which rank and what part of the
organism's scientific name is the genus, the species, and so on based
solely on the data in the file, which comes down to a best-guess
scenario for many organisms. It does work, but not equally well for
all RichSeq files, not for every organism, and definitely not all the
time. So, yes, genus(), species(), binomial, and other methods are
present, but one must realize that parsing out the data into the
appropriate object data using the various get/sets, with the obvious
exceptions, is not the best way.
Unless... you incorporate information available only outside the
actual file itself (i.e. NCBI Taxonomy information). This is where
Bio::Taxonomy seems to come along, as it's not-species specific (it
can represent any rank) and is also DB-aware. Though Bio::Species
was originally going to delegate all its data to Bio::Taxonomy::Node,
I think the purpose was to eventually replace Bio::Species.
So, my question is, why not use a Bio::Taxonomy::Node-like class
initially to contain the appropriate data for a GenBank file (just
for read/write purposes)? This object, since it implements
Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a
database could also get/set the appropriate object data correctly
using the lineage data. So, for instance, if I called
$species = $seq->species();
and wanted the classification, scientific_name(), common_name, and
other information that is gleaned from the file, then there's no need
for a lookup. Once you cross into the bounds of:
print $species->species();
print $species->genus();
then there's trouble, since we're working straight from the file
(i.e. parsing is mainly correct, but still guesswork and sometimes
wrong). But what if you could do something like this:
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
# normally not needed as this is set by default internally, but as a
demo here...
$species->db_handle($db);
# reset the appropriate data (genus, species, etc) based on Entrez
tax data
$species->reset_data(); # this method, BTW, doesn't exist yet but
should be easy to implement
print $species->species();
my $parent = $species->get_Parent_Node;
my @child = $species->get_Children_Nodes;
...and so on
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From prabubio at gmail.com Fri Jul 21 02:17:41 2006
From: prabubio at gmail.com (Prabu R)
Date: Fri, 21 Jul 2006 11:47:41 +0530
Subject: [Bioperl-l] Blast Output Parsing
In-Reply-To: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine>
References:
<002b01c6ac22$5d75e1f0$15327e82@pyrimidine>
Message-ID:
It works great
Thanks a lot Mr.Chris.
R. Prabu
On 7/20/06, Chris Fields wrote:
>
> Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object.
> You can then use Bio::AlignIO to generate the alignment output if needed,
> or
> use the Bio::SimpleAlign methods to get what you want.
>
> http://www.bioperl.org/wiki/HOWTO:Beginners
>
> http://www.bioperl.org/wiki/HOWTO:SearchIO
>
>
> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign
> .html
>
> Chris
>
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Prabu R
> > Sent: Thursday, July 20, 2006 11:02 AM
> > To: bioperl-l at lists.open-bio.org
> > Subject: [Bioperl-l] Blast Output Parsing
> >
> > Dear All!
> >
> > I am now trying to parse a Blast output using PERL.
> >
> > I have to extract each alignment and have to parse the alignment. I
> mean,
> > I
> > have to check whether a particular part of the given sequence got
> aligned
> > 100%.
> >
> > Anybody please tell me what module in PERL I have to use for getting
> this.
> >
> > I've tried Bio::SearchIO. But I didnt get any method to get the
> > alignment.
> >
> > Kindly help.
> >
> > Thanks,
> > R. Prabu
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
"Every noble work is at first impossible."
- Thomas Carlyle
From mh6 at sanger.ac.uk Fri Jul 21 05:04:57 2006
From: mh6 at sanger.ac.uk (Michael Han)
Date: Fri, 21 Jul 2006 10:04:57 +0100
Subject: [Bioperl-l] PAML parser
Message-ID: <44C098B9.4090003@sanger.ac.uk>
Hi,
I have some questions about the PAML parser (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help.
If you call next_result, $self->_parse_summary might be called, which loops over $self->_readline .
Later in next_result when "while (defined ($_=$self->_readline))" is used isn't the filepointer/filehandle
already at the end of the output file and should return undef breaking the parsing?
I added a crude seek($self->{_filehandle},0,0) after the _parse_summary and it seemed to work, but I wonder if I missed something obvious.
thanks,
Mike
From cjfields at uiuc.edu Fri Jul 21 08:22:01 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 21 Jul 2006 07:22:01 -0500
Subject: [Bioperl-l] PAML parser
In-Reply-To: <44C098B9.4090003@sanger.ac.uk>
References: <44C098B9.4090003@sanger.ac.uk>
Message-ID:
Normally when you parse a report you use a loop to iterate through
results:
while (my $result = $parser->next_result) {
# do work here
}
So returning undef is necessary to end the loop. This type of loop
construct is common in BioPerl (and in Perl in general).
There is a HOWTO for PAML:
http://www.bioperl.org/wiki/HOWTO:PAML
Chris
On Jul 21, 2006, at 4:04 AM, Michael Han wrote:
> Hi,
>
> I have some questions about the PAML parser
> (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help.
>
> If you call next_result, $self->_parse_summary might be called,
> which loops over $self->_readline .
>
> Later in next_result when "while (defined ($_=$self->_readline))"
> is used isn't the filepointer/filehandle
> already at the end of the output file and should return undef
> breaking the parsing?
>
> I added a crude seek($self->{_filehandle},0,0) after the
> _parse_summary and it seemed to work, but I wonder if I missed
> something obvious.
>
> thanks,
>
> Mike
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Fri Jul 21 11:50:20 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 21 Jul 2006 10:50:20 -0500
Subject: [Bioperl-l] SNP reference file download
In-Reply-To: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca>
Message-ID: <000901c6acdd$5f38ddb0$15327e82@pyrimidine>
You'll need the latest code from CVS; you could try (the highly
experimental) Bio::DB::EUtilities to get the raw flatfile XML data, then
pass everything through Bio::ClusterIO. Currently there isn't tempfile,
file, or filehandle support for the EUtilities but I plan on adding this
soon. You could also pipe STDOUT from one SNP retrieval script into STDIN
for the ClusterIO.
BTW, the EFetch object below accepts an array reference of primary IDs if
you want to use them instead, so you don't need to run an ESearch query
first. To do this you'll need to set the database parameter (-db => 'snp');
the database from the ESearch query is passed to EFetch via the Cookie
object.
Chris
use Bio::DB::EUtilities;
use Bio::ClusterIO;
# save XML to tempfile for read/write
open my $XMLDATA, '+>', 'tempfile.xml';
# ESearch for term, place data in search history
my $esearch= Bio::DB::EUtilities->new(-eutil => 'esearch',
-term => 'dihydroorotase',
-db => 'snp',
-usehistory => 'y');
$esearch->get_response;
print STDERR "Count: ", $esearch->count,"\n";
# efetch is default EUtility
my $efetch = Bio::DB::EUtilities->new(-cookie => $esearch->next_cookie,
-rettype => 'flt'); # SNP flatfile
print $XMLDATA $efetch->get_response->content;
seek ($XMLDATA, 0, 0); # don't forget to rewind...
my $cio = Bio::ClusterIO->new(-format => 'dbsnp',
-fh => $XMLDATA);
# $snp is a Bio::Variation::snp object, see perldoc for methods
while (my $snp = $cio->next_cluster) {
print "ID : ",$snp->id,"\n";
}
close $XMLDATA;
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca
> Sent: Thursday, July 20, 2006 6:18 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] SNP reference file download
>
>
> Hello All,
>
> I was wondering if anyone knew how to download an entire SNP reference
> file from
> NCBI?? Or even downloading the sequence data for a particular SNP.
>
> I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when
> referring
> to NM_##### but when I try to access rs###### files I am unsure of what
> Bio::DB
> to point to, if there is one.
>
> For example, if I had the accession number: rs4986950 How could I retrieve
> NCBI's
> entire reference file for this SNP record OR just the SNP sequence
> relating to
> this accession number.
>
> Any help on this subject would greatly be appreciated,
>
> Rohan
>
>
> ----------------------------------------
> This mail sent through www.mywaterloo.ca
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Sun Jul 23 15:09:48 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 23 Jul 2006 14:09:48 -0500
Subject: [Bioperl-l] obo_parser.t test warnings
Message-ID:
Hilmar, Sohel,
Didn't know who to notify, so sorry in advance about cross-posting
this to the list. I was running through cleaning up some bugs and
found that obo_parser.t is throwing a ton of warnings:
bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/
obo_parser.t
1..40
"my" variable $val masks earlier declaration in same scope at Bio/
OntologyIO/obo.pm line 592.
"my" variable $qh masks earlier declaration in same scope at Bio/
OntologyIO/obo.pm line 592.
Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line
239, line 13.
...
Good news: all tests pass!
Cheers!
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Sun Jul 23 16:53:32 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 23 Jul 2006 15:53:32 -0500
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
Message-ID: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
Sendu, Hilmar, et al,
I was looking through SeqIO::genbank and though I would bring up a
couple of things to think about re: GenBank Taxonomy information.
This is how NCBI defines the names used for SOURCE and ORGANISM
according to the latest GenBank release notes:
SOURCE - Common name of the organism or the name most frequently used
in the literature. Mandatory keyword in all annotated entries/one or
more records/includes one subkeyword.
ORGANISM - Formal scientific name of the organism (first line)
and taxonomic classification levels (second and subsequent lines).
Mandatory subkeyword in all annotated entries/two or more records.
According to their sample file page (http://www.ncbi.nlm.nih.gov/
Sitemap/samplerecord.html), the SOURCE is this:
Free-format information including an abbreviated form of the organism
name, sometimes followed by a molecule type. (See section 3.4.10 of
the GenBank release notes for more info.)
The SOURCE can also include the organelle and also may include
additional information, such as an abbreviated name and a common name
in parentheses.
...
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...
Setting scientific_name() isn't a problem; acc. to the above
definition, it is the full name on the ORGANISM line. The lineage
(or classification() array) is also straight-forward. The common_name
(), though as used by Bio::SeqIO::genbank, is the entire SOURCE line
(not just the abbreviated name, but the name and everything else).
No additional parsing is performed on it. write_seq() also seems to
do the wrong thing when rebuilding the SOURCE line as well as the
method writes the subspecies to the line.
I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try
using Bio::Taxonomy::Node objects instead of Bio::Species, then get
the parsing for these lines corrected and simplified. Essentially,
the way NCBI describes it, the main name on the line is actually the
free-form abbreviated name, the name in parentheses is the common
name (optionally present), and the organelle precedes all of these if
present. I want to try getting common_name() to match the common
name found for taxonomy (baker's yeast) rather than have it be a
simple container, add an abbreviated_name() method for the name
container for the SOURCE line, and have the organelle() method
actually be used if an organelle is present (it doesn't seem to be
set at the moment in SeqIO::genbank).
Right now, I have NO idea how EMBL, DDBJ, other formats deal with
organism info; I would think that the main three (GenBank/EMBL-
SwissProt/DDBJ) handle them similarly...(Famous Last Words)
I also propose (I'll probably get yelled at here) NOT actively
supporting additional parsing of species, subspecies, etc directly
from a file w/o a DB lookup. As in, leave species, subspecies, genus
parsing from the flatfile as is (no longer support it) or remove it
completely and leave them unset.
I haven't looked, but I have a strong feeling that the species
parsing in Bio::SeqIO is different from format to format. It really
seems like more trouble than it's worth to maintain this, especially
as there is perfectly valid Taxonomy database information available
either locally using a flatfile or via Entrez. If people want to
have reliable $species->species or $species-genus for taxonomy
information, they will need to have the db_handle() set for the
Bio::Taxonomy::Node object and have an Node-based method to reset
species, genus, etc to the tax database information (maybe
reset_taxon or something along those lines).
Okay, rambled on enough. Any thoughts?
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From hlapp at gmx.net Sun Jul 23 19:40:45 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 23 Jul 2006 19:40:45 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BF86AF.8080408@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk>
Message-ID:
On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
> I'll describe all the changes I've now made and if no-one complains
> I'll
> commit. (I've also made these notes into bug 2047 for easier reference
> in the future.)
>
> Bio::DB::Taxonomy::flatfile
> ---------------------------
> [...]
>
> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
> division as a three letter code, like 'PRI'. However, for consistency
> with entrez and the scientific_name() of the node the division is
> supposed to correspond to, it is now stored as the full name, like
> 'Primates'.
What about adding a method division_code() which would return the 3-
letter abbreviation?
The abbreviation may be needed by flat-file writers, so it may be
handy to have in some cases.
>
> The names->id solution also stores the artificially uniqued names like
> 'Craniata ', allowing you for the first time to retrieve the
> correct id. Previously the search would have simply failed completely.
>
> The names->id solution now handles nodes with scientific names of 'xyz
> (class)', allowing you to retrieve the id with both get_taxonids
> ('xyz')
> and get_taxonids('xyz (class)'). Previously only the latter would
> work.
Should angle brackets be allowed too?
>
> NOTE: the previous 2 changes (and the issues with entrez, see below)
> make flatfile better at searching the taxonomy database than entrez
> module or the website, both in terms of speed and completeness of
> results.
>
> BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
> always being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched)
Maybe there should also be a -names parameter which accepts a hash
reference with keys being the kind of name (scientific, common, etc)
and the values being array references with the set of names of that
kind?
> or the $node->classification() array.
Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
brought over from a flawed (because flat) object model in Bio::Species.
> [...]
>
> Bio::DB::Taxonomy::entrez
> -------------------------
>
> # Bug-fixes
> Special characters like ", ( and ) in the input query string to
> get_taxonid() result in the failure or inaccuracy of the search. These
> characters are now removed prior to submission, allowing for correct
> search results.
> API-CHANGE: entrez has always been able to return multiple ids that
> match a single input name, so I've renamed get_taxonid() to
> get_taxonids() and it returns an array of ids in list context. It
> returns one of the ids in scalar context. For backward compatibility,
> *get_taxonid = \&get_taxonids.
Sounds good to me.
> NOTE: entrez modules (and website) cannot cope with ''
> in the
> query, failing searches like 'Craniata '. For this
> reason, if
> get_taxonids() is given a query with '' it will immediately
> return undefined, saving a pointless website access.
If there is a 'next-best-thing' that is still semantically compatible
with the API documentation, I would do that.
In this case, if there is a in the query the entrez
module should strip it and automatically use the rest for searching.
If indeed multiple IDs match there should be a warning to inform the
user that entrez cannot use the notation to limit the
query results.
In fact, you might as well provide an option to enable an automatic
check for the correct branch for each ID if multiple ones are
returned. I.e., if this option is enabled, the module would
automatically query the parent nodes to see if is in the
lineage, and if not will remove the respective ID from the result
set. The reason you may want to make it optional is because it
potentially costs time. (but in reality I'm not sure why a client
will not want to enable the option - so maybe this should even be
default)
> If you want the id
> of 'Craniata ' you must search for 'Craniata', then get the
> node for each returned id to see which one has a parent node with a
> scientific_name() or common_names() case-insensitive matching to
> 'chordata'.
Yep, see above. The more burden you can shield from the user the better.
> [...]
> Bio::Taxonomy::Node
> -------------------
> [...]
> classification() has a proper solution to finding the classification
> when the array wasn't manually set.
>
> # Improvements
> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
> ('common'). Now
> it is an alias to name('scientific').
> NOTE: node_name is what is set when ->new(-name => $name) is set, so
> flatfile and entrez and user-created nodes now implicitly associate
> the
> name of the node they create with its scientific name.
I'm not even sure node_name() should just be deprecated. The methods
falsely suggests that there is only a single and definitive name for
the taxon node.
In NCBI reality, this is only true for the scientific name of the
node. In real reality, many nodes have multiple scientific names -
taxonomy isn't static and therefore the scientific naming of nodes
isn't either.
> [...]
>
Thanks for the work, all other changes sound great. Thanks also to
Chris for assisting! Looks like this is in much better shape now than
before.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Sun Jul 23 19:44:23 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 23 Jul 2006 19:44:23 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44BD147A.9020103@sendu.me.uk>
References: <003201c6aa81$01db9a30$15327e82@pyrimidine>
<44BD147A.9020103@sendu.me.uk>
Message-ID: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net>
On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote:
>
> [regarding changes to Bio::Taxonomy::Node]
>
> Actually, I'm really strongly leaning toward getting rid of the
> following methods and new() options (and giving up entirely on being
> able to keep 'sapiens' somewhere):
>
> -organelle, organelle()
> -division, division()
> -sub_species, sub_species()
> -variant, variant()
> species(), validate_species_name()
> genus()
> binomial()
>
> As far as I can see none of these methods have any place in a generic
> Node class.
I agree. Some of them are a special case for genbank files (organelle
() etc), and the rest is legacy from Bio::Species.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Sun Jul 23 20:48:22 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 23 Jul 2006 20:48:22 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine>
<44C00805.7090403@sendu.me.uk>
<9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
Message-ID: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
On Jul 21, 2006, at 12:51 AM, Chris Fields wrote:
> my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
>
> # normally not needed as this is set by default internally, but as a
> demo here...
> $species->db_handle($db);
>
> # reset the appropriate data (genus, species, etc) based on Entrez
> tax data
> $species->reset_data(); # this method, BTW, doesn't exist yet but
> should be easy to implement
Don't call this reset_data() as it may be misleading (usually reset()
means to revert into a native or original state). Instead, you would
use fetch_from_db() or something.
However, it seems redundant to me to begin with. If we ignore for a
second that the return value in the following isn't exactly
compatible, why would you not just call
$species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid);
So I think more than anything else, this should be made to work, and
you would have a more seamless interface.
> Short and sweet summary:
>
> Sendu volunteered making changes to Bio::Taxonomy::Node and related
> modules;
> we disagreed on exactly what changes should be made. Sendu wanted a
> stripped-down version of Bio::Taxonomy::Node; I wanted one which would
> support similar methods as in Bio::Species.
Bio::Species should be considered legacy; I think it is flawed as an
object model because it imposes a flat view on something which in
reality is only a node in a tree and not flat at all.
The only real need for the flat view came from the desire to write
sequence files - for all other purposes the classification() etc
attributes are useless anyway.
I.e., binomial() and common_name() (corresponding to scientific_name
() and names('common')) are the only real useful attributes, the rest
is baggage for writing sequence files. The baggage should not be
passed on to a better model ...
Instead, there should be a separate module (in essence a Bio::Species
factory) which can translate a Bio::Taxonomy::Node into a
Bio::Species object - if the rank is 'species' or below.
Alternatively, you could have a Bio::Taxonomy::SpeciesNode object
which implements both APIs and can be initialized with either a
Bio::Taxonomy::Node instance, or the combination of a Bio::Species
and a db handle.
At any rate, I think Bio::Taxonomy::Node should be stripped of legacy
methods that are only there to achieve Bio::Species compatibility.
>
> I suggested have a common interface module, one for Node and
> another for
> Species; both implement the same interface methods (NodeI maybe),
> so you
> could use either a bare-bones Node or a full-fledged Species
> object. I then
> suggested this new version of Species could replace Bio::Species.
> We could
> worry about which one to use for Bio::DB::Taxonomy* later.
I'm not following here... How would this look like? What would the API
(s) be?
>
> We both agreed. Everybody's happy.
Happiness is great, so maybe you shouldn't bother about me not
following...
> I still plan on switching Bio::DB::Taxonomy::entrez to use
> Bio::DB::EUtilities at some point
Wouldn't that rather be Bio::DB::Taxonomy::eutil?
> I may
> add a method for retrieving tax data based on protein/nucleotide
> sequence
> primary ID and relevant sequence database, so you could directly
> retrieve
> the relevant TaxID w/o parsing sequences directly for them. This
> would
> mainly be useful if you gather GIs from a BLAST search, for instance.
>
> Anyway, I could add this in then base class Bio::DB::Taxonomy
> directly so
> one could used the retrieved TaxIDs for flat-file or entrez
> searches; this
> requires, of course, access to the remote Entrez database (it would
> use
> ELink). Would that be of interest?
If you add the API methods for this to the base class (which in this
case is close in concept to an interface, much like Bio/SeqIO.pm),
then make clear that not every database will allow you to implement
this.
>
> |------Node
> NodeI----|
> |------Species
>
> Another option would be to have Bio::Taxonomy::Node itself stripped
> down,
> then have another class (Bio::Taxonomy::Species) inherit methods
> from it and
> also implement additional methods (genus(), species(), etc).
I think this would be the way to go. I.e.,
|------Node
NodeI----|
|-|
|----SpeciesNode
Species----|
This way the NodeI interface and its direct implementors are kept
free of legacy.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Sun Jul 23 20:43:45 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 23 Jul 2006 19:43:45 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net>
References: <003201c6aa81$01db9a30$15327e82@pyrimidine>
<44BD147A.9020103@sendu.me.uk>
<6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net>
Message-ID: <5F6027E0-A504-4019-8DAB-C50DF9EB6E18@uiuc.edu>
As an aside, the 'source' seqfeature in a GenBank file contains some
of the following information as tags; that's where the NCBI tax ID is
taken from in Bio::SeqIO::genbank:
FEATURES Location/Qualifiers
source 1..814
/organism="Porterinema fluviatile"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/strain="SAG 124.79"
/db_xref="taxon:246123"
/country="Germany"
...
So, variant(), organelle(), and ncbi_taxid() could all be set from
the same point in Bio::SeqIO::genbank.
I suggested an option to Sendu, but I'd like to hear your thoughts on
this since this will possibly affect bioperl-db. We could have two
Node-like Taxonomy objects using a common interface class
(Bio::Taxonomy::NodeI) : Bio::Taxonomy::Node (stripped down version),
and Bio::Taxonomy::Species (the sequence-based NodeI-implementing
object, which would retain the other Bio::Species-like methods).
Bio::Taxonomy::Species would act sort of as an 'entry point' for
Bio::Taxonomy from sequences; moving up or down the tax node
hierarchy gets Tax::Node objects, unless you are specifically at a
'species'-ranked node (though this could be just a Tax::Node as well).
BTW, I have managed to get Bio::SeqIO::genbank switched over to
Bio::Taxonomy::Node (er... Bio::Taxonomy::Species); all tests pass.
I was quite surprised how easy it was. It shouldn't be too hard to
get a NodeI/Node/Species class hierarchy set up if everybody thinks
it's worth it. Then we could deprecate Bio::Species!
Chris
On Jul 23, 2006, at 6:44 PM, Hilmar Lapp wrote:
>
> On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote:
>
>>
>> [regarding changes to Bio::Taxonomy::Node]
>>
>> Actually, I'm really strongly leaning toward getting rid of the
>> following methods and new() options (and giving up entirely on being
>> able to keep 'sapiens' somewhere):
>>
>> -organelle, organelle()
>> -division, division()
>> -sub_species, sub_species()
>> -variant, variant()
>> species(), validate_species_name()
>> genus()
>> binomial()
>>
>> As far as I can see none of these methods have any place in a generic
>> Node class.
>
> I agree. Some of them are a special case for genbank files (organelle
> () etc), and the rest is legacy from Bio::Species.
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From hlapp at gmx.net Sun Jul 23 20:58:32 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 23 Jul 2006 20:58:32 -0400
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
Message-ID:
On Jul 23, 2006, at 4:53 PM, Chris Fields wrote:
> I also propose (I'll probably get yelled at here) NOT actively
> supporting additional parsing of species, subspecies, etc directly
> from a file w/o a DB lookup. As in, leave species, subspecies, genus
> parsing from the flatfile as is (no longer support it) or remove it
> completely and leave them unset.
Note that most (as in: most used, not most taxa) cases are actually
straightforward. I don't think removing what's there is desirable,
just everyone needs to understand that it will recognize only a
limited number of syntactical variations, and beyond that if you want
correct taxon attributes you will a database (be it flatfile, eutil,
whatever) lookup.
> If people want to
> have reliable $species->species or $species-genus for taxonomy
> information, they will need to have the db_handle() set for the
> Bio::Taxonomy::Node object and have an Node-based method to reset
> species, genus, etc to the tax database information (maybe
> reset_taxon or something along those lines).
That's what I've saying all along.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Sun Jul 23 23:30:07 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 23 Jul 2006 22:30:07 -0500
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To:
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
Message-ID: <28D3470B-DA8F-4C41-96C7-F0D0DE89BAEE@uiuc.edu>
On Jul 23, 2006, at 7:58 PM, Hilmar Lapp wrote:
>
> On Jul 23, 2006, at 4:53 PM, Chris Fields wrote:
>
>> I also propose (I'll probably get yelled at here) NOT actively
>> supporting additional parsing of species, subspecies, etc directly
>> from a file w/o a DB lookup. As in, leave species, subspecies, genus
>> parsing from the flatfile as is (no longer support it) or remove it
>> completely and leave them unset.
>
> Note that most (as in: most used, not most taxa) cases are actually
> straightforward. I don't think removing what's there is desirable,
> just everyone needs to understand that it will recognize only a
> limited number of syntactical variations, and beyond that if you
> want correct taxon attributes you will a database (be it flatfile,
> eutil, whatever) lookup.
Aha! We seem to agree on that...
>> If people want to
>> have reliable $species->species or $species-genus for taxonomy
>> information, they will need to have the db_handle() set for the
>> Bio::Taxonomy::Node object and have an Node-based method to reset
>> species, genus, etc to the tax database information (maybe
>> reset_taxon or something along those lines).
>
> That's what I've saying all along.
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
I thought you had mentioned something about this a few months back on
EMBL format issues with organism data. Anyway, I don't think it was
from anybody disagreeing with you as much as it was one of the
project priorities that sort of got lost in the shuffle. I'm sure
Sendu will like having a bit of freedom with Bio::Taxonomy::Node.
Anyway, I'll do what I can within reason; I have to leave next
weekend for a 5-day conference.
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Mon Jul 24 04:21:55 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 09:21:55 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
<21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
Message-ID: <44C48323.5060704@sendu.me.uk>
Hilmar Lapp wrote:
> On Jul 21, 2006, at 12:51 AM, Chris Fields wrote:
>
>> my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
>>
>> # normally not needed as this is set by default internally, but as a
>> demo here...
>> $species->db_handle($db);
>>
>> # reset the appropriate data (genus, species, etc) based on Entrez
>> tax data
>> $species->reset_data(); # this method, BTW, doesn't exist yet but
>> should be easy to implement
>
> Don't call this reset_data() as it may be misleading (usually reset()
> means to revert into a native or original state). Instead, you would
> use fetch_from_db() or something.
>
> However, it seems redundant to me to begin with. If we ignore for a
> second that the return value in the following isn't exactly
> compatible, why would you not just call
>
> $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid);
If Bio::Species was a Bio::Taxonomy, and we had FactoryI implementing
classes or similar, we would say:
$species = $factory->fetch(-taxon_id => $species->ncbi_taxid);
> Instead, there should be a separate module (in essence a Bio::Species
> factory) which can translate a Bio::Taxonomy::Node into a
> Bio::Species object - if the rank is 'species' or below.
I don't think a 'translation' module is necessary. Bio::Species can just
be a Bio::Taxonomy.
> At any rate, I think Bio::Taxonomy::Node should be stripped of legacy
> methods that are only there to achieve Bio::Species compatibility.
Yes :)
> I think this would be the way to go. I.e.,
>
>
> |------Node
> NodeI----|
> |-|
> |----SpeciesNode
> Species----|
Actually, if we're changing the name of the module that Species
interacts with, any existing code needs to be re-written. So why not
just do it properly and have Bio::Species interact with Bio::Taxonomy?
|----Bio::Taxonomy
Bio::TaxonomyI----|
|----Bio::Species
Or
Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species
Leaving Node completely free to be just a node. This way we don't have a
crufty SpeciesNode there simply for the sake of Bio::Species.
Bio::Species itself provides all the legacy stuff it needs for itself,
while interacting with Nodes via TaxonomyI methods in the 'correct' way
only.
From bix at sendu.me.uk Mon Jul 24 03:58:57 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 08:58:57 +0100
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
Message-ID: <44C47DC1.8020503@sendu.me.uk>
Chris Fields wrote:
> Sendu, Hilmar, et al,
>
> I was looking through SeqIO::genbank and though I would bring up a
> couple of things to think about re: GenBank Taxonomy information.
[...]
> SOURCE - Common name of the organism or the name most frequently used
> in the literature. Mandatory keyword in all annotated entries/one or
> more records/includes one subkeyword.
[...]
> Free-format information including an abbreviated form of the organism
> name, sometimes followed by a molecule type. (See section 3.4.10 of
> the GenBank release notes for more info.)
>
> The SOURCE can also include the organelle and also may include
> additional information, such as an abbreviated name and a common name
> in parentheses.
More specifically:
(from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
The SOURCE field consists of two parts. The first part is found after
the SOURCE keyword and contains free-format information including an
abbreviated form of the organism name followed by a molecule type;
multiple lines are allowed, but the last line must end with a period.
The second part consists of information found after the ORGANISM
subkeyword. The formal scientific name for the source organism (genus
and species, where appropriate) is found on the same line as ORGANISM.
The records following the ORGANISM line list the taxonomic
classification levels, separated by semicolons and ending with a
period.
> The common_name (), though as used by Bio::SeqIO::genbank, is the
> entire SOURCE line (not just the abbreviated name, but the name and
> everything else). No additional parsing is performed on it.
> write_seq() also seems to do the wrong thing when rebuilding the
> SOURCE line as well as the method writes the subspecies to the line.
>
> I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try
> using Bio::Taxonomy::Node objects instead of Bio::Species, then get
> the parsing for these lines corrected and simplified. Essentially,
> the way NCBI describes it, the main name on the line is actually the
> free-form abbreviated name, the name in parentheses is the common
> name (optionally present), and the organelle precedes all of these if
> present. I want to try getting common_name() to match the common
> name found for taxonomy (baker's yeast) rather than have it be a
> simple container, add an abbreviated_name() method for the name
> container for the SOURCE line, and have the organelle() method
> actually be used if an organelle is present (it doesn't seem to be
> set at the moment in SeqIO::genbank).
This is not how I read the specification. Everything on the the same
line as 'Source' is free-format text and therefore cannot be parsed. For
the purposes of writing out it must be stored as-is, but it serves no
other useful purpose. The file also provides the scientific name which
can be used to do an accurate database lookup, which in turn gives you
access to the common names, like "baker's yeast".
On a side note, why would we care about 'organelle' when we're dealing
with taxonomy? Why does the NCBI taxonomy db have a slot for organelle?
From bix at sendu.me.uk Mon Jul 24 04:45:38 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 09:45:38 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk>
Message-ID: <44C488B2.5070806@sendu.me.uk>
Hilmar Lapp wrote:
> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
>
>> Bio::DB::Taxonomy::flatfile
>> ---------------------------
>> [...]
>>
>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
>> division as a three letter code, like 'PRI'. However, for consistency
>> with entrez and the scientific_name() of the node the division is
>> supposed to correspond to, it is now stored as the full name, like
>> 'Primates'.
>
> What about adding a method division_code() which would return the 3-
> letter abbreviation?
>
> The abbreviation may be needed by flat-file writers, so it may be
> handy to have in some cases.
As far as I know you can't get the 3-letter version via entrez, so no
other module can really expect to be able to get it, not knowing which
database (flatfile.pm or entez.pm) the taxonomic information is coming from.
But of course it would be somewhat harmless to add division_code()
anyway. It might be better done as a -code => 1 option to division()?
>> The names->id solution also stores the artificially uniqued names like
>> 'Craniata ', allowing you for the first time to retrieve the
>> correct id. Previously the search would have simply failed completely.
>>
>> The names->id solution now handles nodes with scientific names of 'xyz
>> (class)', allowing you to retrieve the id with both get_taxonids
>> ('xyz')
>> and get_taxonids('xyz (class)'). Previously only the latter would
>> work.
>
> Should angle brackets be allowed too?
Allowed in what sense? You can indeed search for both
get_taxonids('Craniata ') [returns a single id] and
get_taxonids('Craniata') [returns multipe ids, one of which is the
previous answer].
> Maybe there should also be a -names parameter which accepts a hash
> reference with keys being the kind of name (scientific, common, etc)
> and the values being array references with the set of names of that
> kind?
Not sure what you mean. name() has that data structure, though you're
not supposed to set its hash ref directly.
>> or the $node->classification() array.
>
> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
> brought over from a flawed (because flat) object model in Bio::Species.
Yes, I agree.
>> NOTE: entrez modules (and website) cannot cope with ''
>> in the
>> query, failing searches like 'Craniata '. For this
>> reason, if
>> get_taxonids() is given a query with '' it will immediately
>> return undefined, saving a pointless website access.
>
> If there is a 'next-best-thing' that is still semantically compatible
> with the API documentation, I would do that.
>
> In this case, if there is a in the query the entrez
> module should strip it and automatically use the rest for searching.
> If indeed multiple IDs match there should be a warning to inform the
> user that entrez cannot use the notation to limit the
> query results.
I wouldn't like this. I actually had it working this way initially, but
decided that if someone entered 'xyz ' they really didn't
want multiple ids, expected to get multiple ids with just 'xyz' and
don't want their query made something else and then be warned about it.
> In fact, you might as well provide an option to enable an automatic
> check for the correct branch for each ID if multiple ones are
> returned. I.e., if this option is enabled, the module would
> automatically query the parent nodes to see if is in the
> lineage, and if not will remove the respective ID from the result
> set. The reason you may want to make it optional is because it
> potentially costs time. (but in reality I'm not sure why a client
> will not want to enable the option - so maybe this should even be
> default)
I can certainly add that, it seems like a good idea. I don't, however,
see any scope for an option at all. What would the option be called?
-don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
imho. If the user queries 'xyz ' with that option, they're
just going to have to do for themselves manually what the method would
have done for them without that option, in order to get the correct
answer. It'll be slower that way, if anything. So the option would
actually be called
-don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower
(!).
>> Bio::Taxonomy::Node
>> -------------------
>> [...]
>> classification() has a proper solution to finding the classification
>> when the array wasn't manually set.
>>
>> # Improvements
>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
>> ('common'). Now
>> it is an alias to name('scientific').
>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
>> flatfile and entrez and user-created nodes now implicitly associate
>> the
>> name of the node they create with its scientific name.
>
> I'm not even sure node_name() should just be deprecated. The methods
> falsely suggests that there is only a single and definitive name for
> the taxon node.
>
> In NCBI reality, this is only true for the scientific name of the
> node. In real reality, many nodes have multiple scientific names -
> taxonomy isn't static and therefore the scientific naming of nodes
> isn't either.
For the programmer not using any database but just making up his own
nodes, I think he needs a node_name() because he may not be thinking
about anything fancy or realistic. He just want to give his node a
single name that he invents. node_name() seems like the ideal method
name to me.
From jaynelvallance at hotmail.com Mon Jul 24 05:45:50 2006
From: jaynelvallance at hotmail.com (Jayne Vallance)
Date: Mon, 24 Jul 2006 09:45:50 +0000
Subject: [Bioperl-l] SearchIO - Stop throwing away data
Message-ID:
Hi
I developing someone
elses work. I wondered whether anyone could identify the
mistake that the previous coder made?
I am not very familiar with SearchIO yet.
They are trying to extract filenames from an output report.
This is their code:
# store the query name of the mito db blast hits into an array
my $searchio = new Bio::SearchIO( -file => $blast_mito_output );
# array to store the mitochondrial BLAST database hits
my @mito_hits;
# name of query for BLAST hit
my $query_name;
while ( my $result = $searchio->next_result() ) {
# get the hits and their associated name
# do not want to include these in the clustering step
while( my $hit = $result->next_hit ) {
# store the names of these hits into an array
# these filenames will not be copied over
$query_name = $result->query_name();
#print "\nQuery $query_name\n";
push(@mito_hits, $query_name);
}
}
I think they have based it on the code at
http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors
use Bio::SearchIO;
use Bio::SearchIO::FastHitEventBuilder;
my $searchio = new Bio::SearchIO(-format => $format, -file => $file);
$searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new);
while( my $r = $searchio->next_result ) {
while( my $h = $r->next_hit ) {
# Hits will NOT have HSPs
print $h->significance,"\n";
}
which "throws away data you don't want"???
I am finding that our code is finding the last file name in the ouput
report,
rather than each and every one. I suspect it is overwriting (or throwing
away the data).
How do I need to change the code to make sure *every* file name goes
into @mito_hits?
Thankyou
Jayne
_________________________________________________________________
The new MSN Search Toolbar now includes Desktop search!
http://join.msn.com/toolbar/overview
From simon.andrews at bbsrc.ac.uk Mon Jul 24 07:14:08 2006
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 24 Jul 2006 12:14:08 +0100
Subject: [Bioperl-l] SearchIO - Stop throwing away data
In-Reply-To:
Message-ID:
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
> Jayne Vallance
> Sent: 24 July 2006 10:46
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] SearchIO - Stop throwing away data
>
> Hi
>
> I developing someone
> elses work. I wondered whether anyone could identify the
> mistake that the previous coder made?
> I am not very familiar with SearchIO yet.
>
> They are trying to extract filenames from an output report.
I'm not sure what you mean by filenames here. The value which is being
collected in your code snippet is the name of the original query
sequence.
> This is their code:
> while ( my $result = $searchio->next_result() ) {
> # get the hits and their associated name
> # do not want to include these in the clustering step
> while( my $hit = $result->next_hit ) {
> # store the names of these hits into an array
> # these filenames will not be copied over
> $query_name = $result->query_name();
> #print "\nQuery $query_name\n";
> push(@mito_hits, $query_name);
OK, this bit is odd. You're collecting the name of the query sequence
but you're doing it as you're looping through the hits. Since all the
hits come from the same result you're just going to get the same query
name put into your array multiple times (once for each hit). This
almost certainly isn't what you want.
If you just want the name of the query sequence you can miss out the
inner loop (the $result->next_hit() loop). If you actually want to
collect the names of the sequences which were hit then you need to
collect $hit->name() rather than $result->query_name();
> }
> }
>
> I think they have based it on the code at
> http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors
> $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuil
der->new);
> while( my $r = $searchio->next_result ) { while( my $h =
> $r->next_hit ) {
> # Hits will NOT have HSPs
> print $h->significance,"\n";
> }
>
> which "throws away data you don't want"???
Indeed, but probably not in the way you're thinking. The data it throws
away is the details of each individual HSP (mostly the alinment data).
You're not using hsp data in your code so it will have no effect (other
than making it a bit quicker). It doesn't throw away whole hits or
anything like that.
> I am finding that our code is finding the last file name in
> the ouput report, rather than each and every one. I suspect
> it is overwriting (or throwing away the data).
I suspect then that you should be collecting the hit names rather than
the query names?
Simon.
From hlapp at gmx.net Mon Jul 24 08:20:00 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 08:20:00 -0400
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <44C47DC1.8020503@sendu.me.uk>
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
<44C47DC1.8020503@sendu.me.uk>
Message-ID: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net>
On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote:
> On a side note, why would we care about 'organelle' when we're dealing
> with taxonomy? Why does the NCBI taxonomy db have a slot for
> organelle?
Because some sequences are of the organelle DNA, and Genbank needs a
way to express this. Highly artificial, but still can't be ignored.
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Mon Jul 24 08:27:28 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 08:27:28 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C488B2.5070806@sendu.me.uk>
References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk>
<44C488B2.5070806@sendu.me.uk>
Message-ID: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net>
:-) I think we're largely in agreement. As for node_name() I fully
understand the motivation, but it needs to be understood that the
attribute's value will be based on a largely arbitrary choice unless
it is set directly by the user.
-hilmar
On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
>>
>>> Bio::DB::Taxonomy::flatfile
>>> ---------------------------
>>> [...]
>>>
>>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it
>>> makes the
>>> division as a three letter code, like 'PRI'. However, for
>>> consistency
>>> with entrez and the scientific_name() of the node the division is
>>> supposed to correspond to, it is now stored as the full name, like
>>> 'Primates'.
>>
>> What about adding a method division_code() which would return the 3-
>> letter abbreviation?
>>
>> The abbreviation may be needed by flat-file writers, so it may be
>> handy to have in some cases.
>
> As far as I know you can't get the 3-letter version via entrez, so no
> other module can really expect to be able to get it, not knowing which
> database (flatfile.pm or entez.pm) the taxonomic information is
> coming from.
>
> But of course it would be somewhat harmless to add division_code()
> anyway. It might be better done as a -code => 1 option to division()?
>
>
>>> The names->id solution also stores the artificially uniqued names
>>> like
>>> 'Craniata ', allowing you for the first time to
>>> retrieve the
>>> correct id. Previously the search would have simply failed
>>> completely.
>>>
>>> The names->id solution now handles nodes with scientific names of
>>> 'xyz
>>> (class)', allowing you to retrieve the id with both get_taxonids
>>> ('xyz')
>>> and get_taxonids('xyz (class)'). Previously only the latter would
>>> work.
>>
>> Should angle brackets be allowed too?
>
> Allowed in what sense? You can indeed search for both
> get_taxonids('Craniata ') [returns a single id] and
> get_taxonids('Craniata') [returns multipe ids, one of which is the
> previous answer].
>
>
>> Maybe there should also be a -names parameter which accepts a hash
>> reference with keys being the kind of name (scientific, common, etc)
>> and the values being array references with the set of names of that
>> kind?
>
> Not sure what you mean. name() has that data structure, though you're
> not supposed to set its hash ref directly.
>
>
>>> or the $node->classification() array.
>>
>> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
>> brought over from a flawed (because flat) object model in
>> Bio::Species.
>
> Yes, I agree.
>
>
>>> NOTE: entrez modules (and website) cannot cope with ''
>>> in the
>>> query, failing searches like 'Craniata '. For this
>>> reason, if
>>> get_taxonids() is given a query with '' it will
>>> immediately
>>> return undefined, saving a pointless website access.
>>
>> If there is a 'next-best-thing' that is still semantically compatible
>> with the API documentation, I would do that.
>>
>> In this case, if there is a in the query the entrez
>> module should strip it and automatically use the rest for searching.
>> If indeed multiple IDs match there should be a warning to inform the
>> user that entrez cannot use the notation to limit the
>> query results.
>
> I wouldn't like this. I actually had it working this way initially,
> but
> decided that if someone entered 'xyz ' they really didn't
> want multiple ids, expected to get multiple ids with just 'xyz' and
> don't want their query made something else and then be warned about
> it.
>
>
>> In fact, you might as well provide an option to enable an automatic
>> check for the correct branch for each ID if multiple ones are
>> returned. I.e., if this option is enabled, the module would
>> automatically query the parent nodes to see if is in the
>> lineage, and if not will remove the respective ID from the result
>> set. The reason you may want to make it optional is because it
>> potentially costs time. (but in reality I'm not sure why a client
>> will not want to enable the option - so maybe this should even be
>> default)
>
> I can certainly add that, it seems like a good idea. I don't, however,
> see any scope for an option at all. What would the option be called?
> -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
> imho. If the user queries 'xyz ' with that option, they're
> just going to have to do for themselves manually what the method would
> have done for them without that option, in order to get the correct
> answer. It'll be slower that way, if anything. So the option would
> actually be called
> -
> don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt
> le_slower
> (!).
>
>
>>> Bio::Taxonomy::Node
>>> -------------------
>>> [...]
>>> classification() has a proper solution to finding the classification
>>> when the array wasn't manually set.
>>>
>>> # Improvements
>>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
>>> ('common'). Now
>>> it is an alias to name('scientific').
>>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
>>> flatfile and entrez and user-created nodes now implicitly associate
>>> the
>>> name of the node they create with its scientific name.
>>
>> I'm not even sure node_name() should just be deprecated. The methods
>> falsely suggests that there is only a single and definitive name for
>> the taxon node.
>>
>> In NCBI reality, this is only true for the scientific name of the
>> node. In real reality, many nodes have multiple scientific names -
>> taxonomy isn't static and therefore the scientific naming of nodes
>> isn't either.
>
> For the programmer not using any database but just making up his own
> nodes, I think he needs a node_name() because he may not be thinking
> about anything fancy or realistic. He just want to give his node a
> single name that he invents. node_name() seems like the ideal method
> name to me.
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Mon Jul 24 08:31:44 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 08:31:44 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C48323.5060704@sendu.me.uk>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
<21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
<44C48323.5060704@sendu.me.uk>
Message-ID: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
Sounds good to me, except there is no Bio::TaxonomyI yet, and also
Bio::Species shouldn't fully depend on an internet connection or flat
file to do anything meaningful.
I.e., it should take advantage of a lookup database if there is one,
but in the absence of that one should also be able to statically set
attribute values to whatever one thinks can be gleaned from a parsed
text or whatever.
-hilmar
On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote:
>> I think this would be the way to go. I.e.,
>>
>>
>> |------Node
>> NodeI----|
>> |-|
>> |----SpeciesNode
>> Species----|
>
> Actually, if we're changing the name of the module that Species
> interacts with, any existing code needs to be re-written. So why not
> just do it properly and have Bio::Species interact with Bio::Taxonomy?
>
> |----Bio::Taxonomy
> Bio::TaxonomyI----|
> |----Bio::Species
>
> Or
>
> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species
>
> Leaving Node completely free to be just a node. This way we don't
> have a
> crufty SpeciesNode there simply for the sake of Bio::Species.
> Bio::Species itself provides all the legacy stuff it needs for itself,
> while interacting with Nodes via TaxonomyI methods in the 'correct'
> way
> only.
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From bix at sendu.me.uk Mon Jul 24 08:34:45 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 13:34:45 +0100
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net>
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
<44C47DC1.8020503@sendu.me.uk>
<27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net>
Message-ID: <44C4BE65.8080304@sendu.me.uk>
Hilmar Lapp wrote:
>
> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote:
>
>> On a side note, why would we care about 'organelle' when we're dealing
>> with taxonomy? Why does the NCBI taxonomy db have a slot for organelle?
>
> Because some sequences are of the organelle DNA, and Genbank needs a way
> to express this. Highly artificial, but still can't be ignored.
Ok, but why is it stored as part of the taxonomy? Why isn't it stored in
its own field? And does /bioperl/ have to store it as part of the
taxonomy? Maybe the file parser could have its own organelle() method
and leave all taxonomic classes without such a method. Or it could stay
as is, I don't know.
Do different organelles in the same species get unique taxonomy ids?
From hlapp at gmx.net Mon Jul 24 08:46:51 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 08:46:51 -0400
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <44C4BE65.8080304@sendu.me.uk>
References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu>
<44C47DC1.8020503@sendu.me.uk>
<27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net>
<44C4BE65.8080304@sendu.me.uk>
Message-ID: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net>
On Jul 24, 2006, at 8:34 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>>
>> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote:
>>
>>> On a side note, why would we care about 'organelle' when we're
>>> dealing
>>> with taxonomy? Why does the NCBI taxonomy db have a slot for
>>> organelle?
>> Because some sequences are of the organelle DNA, and Genbank needs
>> a way
>> to express this. Highly artificial, but still can't be ignored.
>
> Ok, but why is it stored as part of the taxonomy? Why isn't it
> stored in
> its own field? And does /bioperl/ have to store it as part of the
> taxonomy?
No, but clients need to be able to obtain it. Organelles have their
own genome. If we talk about the human genome, for instance, most
commonly we refer to the nuclear genome only.
> Maybe the file parser could have its own organelle() method
> and leave all taxonomic classes without such a method. Or it could
> stay
> as is, I don't know.
Like I said above, at the end of the day there needs to be a way to
qualify a sequence by the genome it is part of.
>
> Do different organelles in the same species get unique taxonomy ids?
I would have to confirm, but I believe so. As I said, from a genome/
sequence-centric viewpoint, the organelle and nuclear genomes are two
different things.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From simon.andrews at bbsrc.ac.uk Mon Jul 24 09:34:10 2006
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 24 Jul 2006 14:34:10 +0100
Subject: [Bioperl-l] New EMBL format parsing/writing
Message-ID:
I few weeks ago I saw a couple of messages on this list mentioning the
new ID/SV line format used in the latest EMBL release. I'm in the
process of moving our database server over to the new format and was
looking to update SeqIO::embl.pm.
I'm sure someone said they'd made a patch to fix up parsing of the new
format, but I can't find it either in CVS or bugzilla.
Rather than do this again myself can someone point me to an updated
SeqIO::embl.pm please? If there isn't one then I'll look into making
the patch myself.
Since this is such a major change are there any plans to put out a new
release with this fix included? I'm sure this will start to bite more
people as the new format becomes more widely adopted.
Cheers
Simon.
--
Simon Andrews PhD
Bioinformatics Group
The Babraham Institute
simon.andrews at bbsrc.ac.uk
+44 (0) 1223 496463
From cjfields at uiuc.edu Mon Jul 24 09:44:37 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 08:44:37 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
<21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
<44C48323.5060704@sendu.me.uk>
<8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
Message-ID: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu>
Hence the reason to have it be a hybrid of Bio::Species and
Tax::Node. Bio::SeqIO::genbank works very happily with the current
Bio::Taxonomy::Node now; if we intend to remove most of the method we
need to have a similar DB-aware module to house the flatfile data
(like Bio::Species) yet be capable of working with Bio::Taxonomy
(like Tax::Node).
As for organelle(), that could be made into something else
(Bio::Annotation::SimpleValue or similar) but as it's always been
included with the tax data, that's where it has been. The TaxID in
the 'source' seqfeature doesn't refer to the organelle but the organism.
Chris
On Jul 24, 2006, at 7:31 AM, Hilmar Lapp wrote:
> Sounds good to me, except there is no Bio::TaxonomyI yet, and also
> Bio::Species shouldn't fully depend on an internet connection or flat
> file to do anything meaningful.
>
> I.e., it should take advantage of a lookup database if there is one,
> but in the absence of that one should also be able to statically set
> attribute values to whatever one thinks can be gleaned from a parsed
> text or whatever.
>
> -hilmar
>
> On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote:
>
>>> I think this would be the way to go. I.e.,
>>>
>>>
>>> |------Node
>>> NodeI----|
>>> |-|
>>> |----SpeciesNode
>>> Species----|
>>
>> Actually, if we're changing the name of the module that Species
>> interacts with, any existing code needs to be re-written. So why not
>> just do it properly and have Bio::Species interact with
>> Bio::Taxonomy?
>>
>> |----Bio::Taxonomy
>> Bio::TaxonomyI----|
>> |----Bio::Species
>>
>> Or
>>
>> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species
>>
>> Leaving Node completely free to be just a node. This way we don't
>> have a
>> crufty SpeciesNode there simply for the sake of Bio::Species.
>> Bio::Species itself provides all the legacy stuff it needs for
>> itself,
>> while interacting with Nodes via TaxonomyI methods in the 'correct'
>> way
>> only.
>>
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Mon Jul 24 09:49:42 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 14:49:42 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
<21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
<44C48323.5060704@sendu.me.uk>
<8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
Message-ID: <44C4CFF6.40609@sendu.me.uk>
Hilmar Lapp wrote:
> Sounds good to me, except there is no Bio::TaxonomyI yet,
Indeed, I propose making one.
> Bio::Species shouldn't fully depend on an internet connection or flat
> file to do anything meaningful.
>
> I.e., it should take advantage of a lookup database if there is one, but
> in the absence of that one should also be able to statically set
> attribute values to whatever one thinks can be gleaned from a parsed
> text or whatever.
Yes, which is why Bio::Taxonomy is appropriate here. Assuming that
Bio::Species isa Bio::TaxonomyI:
...
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...
## the fully-manual way
my $species = new Bio::Species;
my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
-rank => 'species', -object_id => 1,
-parent_id => 2);
my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
-object_id => 2, -parent_id => 3);
# (no assumption that 'Saccharomyces' is the genus, so rank() undefined)
my $n3 = [etc]
$species->add_node($node);
$species->add_node($n2);
[etc]
## Using a factory without db access
# assume that Bio::Taxonomy::GenbankFactory implements
# some modified Bio::Taxonomy::FactoryI
my $factory = Bio::Taxonomy::GenbankFactory->new();
my $species = $factory->generate(-classification => ['Saccharomyces
cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]);
# the generate() method above just does the fully-manual way for you
## Using a factory with db access
# assume that Bio::Taxonomy::EntrezFactory implements some
# modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
# to get the nodes
my $factory = Bio::Taxonomy::EntrezFactory->new();
my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
cerevisiae');
# (would probably want to come up with a more generic name for the
# fetch() and generate() methods, so that all Factories use the same
# same method name)
It's very clean and flexible this way. Ultimately you always make your
Bio::Species the same way - you add nodes to it. You can make those
nodes yourself or use a factory.
We also solve Chris' earlier quandary:
[ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode
exist, and given that Bio::DB::Taxonomy* currently directly make Node
objects ]
> The only problem I can foresee is which class to use with
> Bio::DB::Taxonomy*? I guess one could settle on one class by default and
> have the option to use another Bio::Taxonomy::NodeI-implementing class if
> you wanted more data/methods available...
The way to do it is to have the Bio::DB::Taxonomy* modules return only
the information that a Bio::Taxonomy::FactoryI would need to make a
NodeI. The specific Factory that you use could generate whatever type of
Node you wanted.
But actually I propose there is only one Node and the specific Factory
that you use determines the kind of Bio::TaxonomyI made; GenbankFactory
might make a Bio::Species, while EntrezFactory might make a Bio::Taxonomy.
Bio::Species differs from Bio::Taxonomy only so it contains all the
legacy methods names that Bio::Species currently has, for backward
compatibility. Setting $species->classification() would delete all nodes
of self, use a GenbankFactory to make a new Bio::Species, then pull out
all its Nodes and add them to self.
Unless anyone can think of a better way of doing things, I'll explore
the above ideas and start writing code. To summarise: major changes to
Bio::DB::Taxonomy* (make them factory slaves), implementation of some
Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make
Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI.
Oh, Bio::Taxonomy might need some changes as well. It has a classify()
method does something with a Bio::Species, which would be all wrong in
the new way of doing things.
From bix at sendu.me.uk Mon Jul 24 09:53:23 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 14:53:23 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu>
References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu>
<21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net>
<44C48323.5060704@sendu.me.uk>
<8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net>
<5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu>
Message-ID: <44C4D0D3.1020506@sendu.me.uk>
Chris Fields wrote:
> Bio::SeqIO::genbank works very happily with the current
> Bio::Taxonomy::Node now; if we intend to remove most of the method we
> need to have a similar DB-aware module to house the flatfile data (like
> Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node).
Can you give code examples of what Bio::SeqIO::genbank is doing and what
makes it 'happy'? What are the requirements? Would it be as happy
working with a Bio::Taxonomy object?
From aramsey at vecna.com Mon Jul 24 10:23:46 2006
From: aramsey at vecna.com (Al Ramsey)
Date: Mon, 24 Jul 2006 10:23:46 -0400
Subject: [Bioperl-l] Making BioPerl Faster
Message-ID: <44C4D7F2.6020107@vecna.com>
I'm interested into following up with a suggestion from the bioperl.org
site about making it faster
(http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I
wanted to look a little more into how the object instantiations might be
more efficient. Is anyone else looking into this actively now? I want
to ask if anyone had any additional insights that weren't previously
published before I started.
Thank you,
Al Ramsey
--
Alvin Ramsey, PhD.
Vecna Technologies, Inc.
5205 Leesburg Pike
Falls Church, VA 22041
aramsey at vecna.com
t: 703.998.5333
f: 703.998.5816
From s-merchant at northwestern.edu Mon Jul 24 11:09:49 2006
From: s-merchant at northwestern.edu (Sohel Merchant)
Date: Mon, 24 Jul 2006 10:09:49 -0500
Subject: [Bioperl-l] obo_parser.t test warnings
In-Reply-To:
Message-ID: <004301c6af33$3564a8e0$c2987ca5@pc13>
Hey Chris,
I usually run perl with all warnings disabled. So I never saw these. I
will put a fix to them sometime this week.
Thanks,
Sohel.
_____
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Sunday, July 23, 2006 2:10 PM
To: bioperl-l List; Hilmar Lapp; s-merchant at northwestern.edu
Subject: obo_parser.t test warnings
Hilmar, Sohel,
Didn't know who to notify, so sorry in advance about cross-posting this to
the list. I was running through cleaning up some bugs and found that
obo_parser.t is throwing a ton of warnings:
bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w
t/obo_parser.t
1..40
"my" variable $val masks earlier declaration in same scope at
Bio/OntologyIO/obo.pm line 592.
"my" variable $qh masks earlier declaration in same scope at
Bio/OntologyIO/obo.pm line 592.
Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239,
line 13.
...
Good news: all tests pass!
Cheers!
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From prabubio at gmail.com Mon Jul 24 11:39:43 2006
From: prabubio at gmail.com (Prabu R)
Date: Mon, 24 Jul 2006 21:09:43 +0530
Subject: [Bioperl-l] Remote Blast Execution
Message-ID:
Dear All!
I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast.
I am not able to get the blast result.
Upto my knowledge, the Bio::SearchIO::blast hash object does not returns any
result.
Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl 1.5release.
Command:
perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i
/home/prabucn/Blast/mm_test1.fa
Error Message:
retrieving blasts..
-------------------- WARNING ---------------------
MSG: Possible error (1) while parsing BLAST report!
---------------------------------------------------
Please help.
Thanks,
R. Prabu.
Please look into my test program.
----------------------------------------------------------------------------------------------
use Bio::Tools::Run::RemoteBlast;
use strict;
use Bio::SeqIO;
use Bio::SearchIO;
my $prog = 'blastn';
my $db = 'est';
my $e_val= '1e-10';
my @params = ( '-prog' => $prog,
'-data' => $db,
'-expect' => $e_val,
'-readmethod' => 'SearchIO' );
my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant do";
my $v = 1;
my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' );
while (my $input = $str->next_seq()){
my $r = $factory->submit_blast($input);
print STDERR "waiting..." if( $v > 0 );
while ( my @rids = $factory->each_rid ) {
foreach my $rid ( @rids ) {
my $rc = $factory->retrieve_blast($rid);
if( !ref($rc) ) {
if( $rc < 0 ) {
$factory->remove_rid($rid);
}
print STDERR "." if ( $v > 0 );
sleep 5;
} else {
print "$rc\n";
my $result = $rc->next_result();
my $filename = $result->query_name()."\.out";
$factory->save_output($filename);
$factory->remove_rid($rid);
print "\nQuery Name: ", $result->query_name(), "\n";
while ( my $hit = $result->next_hit ) {
next unless ( $v > 0);
print "\thit name is ", $hit->name, "\n";
while( my $hsp = $hit->next_hsp ) {
print "\t\tscore is ", $hsp->score, "\n";
}
}
}
}
}
}
----------------------------------------------------------------------------------------------
--
"Every noble work is at first impossible."
- Thomas Carlyle
From cjfields at uiuc.edu Mon Jul 24 11:48:45 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 10:48:45 -0500
Subject: [Bioperl-l] SearchIO - Stop throwing away data
In-Reply-To:
Message-ID: <001701c6af38$a81c1580$15327e82@pyrimidine>
> Hi
>
> I developing someone
> elses work. I wondered whether anyone could identify the
> mistake that the previous coder made?
> I am not very familiar with SearchIO yet.
>
> They are trying to extract filenames from an output report.
> This is their code:
>
> # store the query name of the mito db blast hits into an array
> my $searchio = new Bio::SearchIO( -file => $blast_mito_output );
> # array to store the mitochondrial BLAST database hits
> my @mito_hits;
> # name of query for BLAST hit
> my $query_name;
>
Just as a gripe here: you should always designate the '-format' here to be
'blast' for BLAST text output.
my $searchio = new Bio::SearchIO(-file => $blast_mito_output,
-format => 'blast' );
The default is still text, so the above works, but that very well may change
in the future.
Each BLAST report is a Result. Each Result contains one or more hits; each
hit contains one or more HSPs. SearchIO only parses the information
contained in the BLAST report (i.e. no filenames). From here, it looks like
you want Hit information, though. The code below copies the query_name from
the BlastResult object, $result (i.e. the name of your query sequence, the
one you submitted for BLAST'ing against a database). You need the BlastHit
data from $hit.
Change :
$query_name = $result->query_name();
#print "\nQuery $query_name\n";
push(@mito_hits, $query_name);
To :
$hit_name = $hit->description();
#print "\nHit $hit_name\n";
push(@mito_hits, $hit_name);
or, for the hit accession, use
$hit_name = $hit->accession();
For all accessions in the description (there may be multiples if sequences
are identical), use an array and
@hit_name = $hit->get_all_accessions();
You can use a different EventHandler if you want to speed things up:
my $searchio = new Bio::SearchIO(-format => $format, -file => $file);
$searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new);
But to have this work you need to update to the latest CVS version of
bioperl; this was a recent bug that was fixed.
Chris
> while ( my $result = $searchio->next_result() ) {
> # get the hits and their associated name
> # do not want to include these in the clustering step
> while( my $hit = $result->next_hit ) {
> # store the names of these hits into an array
> # these filenames will not be copied over
> $query_name = $result->query_name();
> #print "\nQuery $query_name\n";
> push(@mito_hits, $query_name);
> }
> }
> I think they have based it on the code at
> http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors
>
> use Bio::SearchIO;
> use Bio::SearchIO::FastHitEventBuilder;
> my $searchio = new Bio::SearchIO(-format => $format, -file => $file);
>
> $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new);
> while( my $r = $searchio->next_result ) {
> while( my $h = $r->next_hit ) {
> # Hits will NOT have HSPs
> print $h->significance,"\n";
> }
>
> which "throws away data you don't want"???
>
> I am finding that our code is finding the last file name in the ouput
> report,
> rather than each and every one. I suspect it is overwriting (or throwing
> away the data).
>
> How do I need to change the code to make sure *every* file name goes
> into @mito_hits?
>
> Thankyou
>
> Jayne
>
> _________________________________________________________________
> The new MSN Search Toolbar now includes Desktop search!
> http://join.msn.com/toolbar/overview
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From dwaner at scitegic.com Mon Jul 24 12:03:21 2006
From: dwaner at scitegic.com (dwaner at scitegic.com)
Date: Mon, 24 Jul 2006 09:03:21 -0700
Subject: [Bioperl-l] New EMBL format parsing/writing
Message-ID:
Simon,
I have already updated SeqIO::embl.pm to support release 87. All I have
left to do is generate the patch and update the /t test. I will try to
get this submitted to bugzilla today (24 July).
- David
From cjfields at uiuc.edu Mon Jul 24 12:04:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 11:04:40 -0500
Subject: [Bioperl-l] Making BioPerl Faster
In-Reply-To: <44C4D7F2.6020107@vecna.com>
Message-ID: <001901c6af3a$df146ea0$15327e82@pyrimidine>
Give it a look, sure! Not sure if this the only problem though when it
comes to speed; I think it's more complicated than that. I think that (at
least on WinXP) the Perl version used is also partially to blame. It's
possible that something modified between v 5.6 and 5.8 slowed everything
down considerably. I always wondered if it had something to do with Unicode
support in perl 5.8 ...
There is a report on Bugzilla about a dramatic slowdown on sequence parsing
between v. 1.4 and v. 1.5 (including the latest, v 1.5.1)
http://bugzilla.open-bio.org/show_bug.cgi?id=1875
This is unresolved at this time but may be unrelated to the possible perl
versioning issue above.
I've a feeling you may find regexes and redundant methods calls also add
quite a bit of overhead. I've seen several places where accessors are
called over and over w/o assigning to a local variable. Or places where a
tr/// would work much faster than a s///. There was an instance of the
latter in SeqIO which sped up parsing about 2-3x faster on WinXP.
If you want to look at the impact of object instantiation on speed, check
out Bio::SearchIO (parsing of BLAST/FASTA/HMMER reports). Lots of method
calls, object creation, etc.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Al Ramsey
> Sent: Monday, July 24, 2006 9:24 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Making BioPerl Faster
>
> I'm interested into following up with a suggestion from the bioperl.org
> site about making it faster
> (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I
> wanted to look a little more into how the object instantiations might be
> more efficient. Is anyone else looking into this actively now? I want
> to ask if anyone had any additional insights that weren't previously
> published before I started.
>
> Thank you,
> Al Ramsey
>
>
> --
> Alvin Ramsey, PhD.
>
> Vecna Technologies, Inc.
> 5205 Leesburg Pike
> Falls Church, VA 22041
> aramsey at vecna.com
> t: 703.998.5333
> f: 703.998.5816
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Mon Jul 24 12:06:03 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 11:06:03 -0500
Subject: [Bioperl-l] Remote Blast Execution
In-Reply-To:
Message-ID: <001a01c6af3b$10187f50$15327e82@pyrimidine>
You need to update to the latest code (bioperl-live) from CVS. BLAST
parsing using RemoteBlast is broken in all the latest releases.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Prabu R
> Sent: Monday, July 24, 2006 10:40 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Remote Blast Execution
>
> Dear All!
>
> I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast.
>
> I am not able to get the blast result.
> Upto my knowledge, the Bio::SearchIO::blast hash object does not returns
> any
> result.
>
>
> Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl
> 1.5release.
>
> Command:
> perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i
> /home/prabucn/Blast/mm_test1.fa
>
> Error Message:
>
> retrieving blasts..
>
> -------------------- WARNING ---------------------
> MSG: Possible error (1) while parsing BLAST report!
> ---------------------------------------------------
>
> Please help.
>
> Thanks,
> R. Prabu.
>
>
> Please look into my test program.
> --------------------------------------------------------------------------
> --------------------
> use Bio::Tools::Run::RemoteBlast;
> use strict;
> use Bio::SeqIO;
> use Bio::SearchIO;
>
> my $prog = 'blastn';
> my $db = 'est';
> my $e_val= '1e-10';
>
> my @params = ( '-prog' => $prog,
> '-data' => $db,
> '-expect' => $e_val,
> '-readmethod' => 'SearchIO' );
>
> my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant
> do";
>
> my $v = 1;
>
> my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta'
> );
>
> while (my $input = $str->next_seq()){
> my $r = $factory->submit_blast($input);
>
> print STDERR "waiting..." if( $v > 0 );
> while ( my @rids = $factory->each_rid ) {
> foreach my $rid ( @rids ) {
> my $rc = $factory->retrieve_blast($rid);
>
> if( !ref($rc) ) {
> if( $rc < 0 ) {
> $factory->remove_rid($rid);
> }
> print STDERR "." if ( $v > 0 );
> sleep 5;
> } else {
> print "$rc\n";
> my $result = $rc->next_result();
> my $filename = $result->query_name()."\.out";
> $factory->save_output($filename);
> $factory->remove_rid($rid);
> print "\nQuery Name: ", $result->query_name(), "\n";
> while ( my $hit = $result->next_hit ) {
> next unless ( $v > 0);
> print "\thit name is ", $hit->name, "\n";
> while( my $hsp = $hit->next_hsp ) {
> print "\t\tscore is ", $hsp->score, "\n";
> }
> }
> }
> }
> }
> }
> --------------------------------------------------------------------------
> --------------------
>
> --
> "Every noble work is at first impossible."
> - Thomas Carlyle
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Mon Jul 24 12:21:39 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 11:21:39 -0500
Subject: [Bioperl-l] New EMBL format parsing/writing
In-Reply-To:
Message-ID: <001c01c6af3d$3df2dc70$15327e82@pyrimidine>
The only proposed EMBL changes I can remember were for Tax data (organism
lines). It shouldn't be hard to change the way these are parsed.
We could leave parsing of SV for older files and run a check on the ID line
format to accommodate old and new sequences, though I have no problem with
only supporting the latest formats. Continual support for old deprecated
sequence formats leads to lots of cruft over time; SwissPort parsing has the
same issue. You would be surprised how many people out there never bother
to update their sequences and use old data...
I believe you are referring to this (from the latest EMBL release notes):
...
2 CHANGES IN THIS RELEASE
2.1 Changes to the Feature Table Document: Chapter 3.5 "Location"
The use of range (.) descriptor within location spans is no longer legal.
2.2 ID line changes
ID line structure underwent the following changes
* All tokens are separated by a semicolon.
* The entry name is not displayed, in its place there is the primary
accession number.
* The sequence version is indicated.
* The topology is a separate token and is indicated for both circular
and linear molecules.
* Both the data class and taxonomic divisions will be displayed.
This is an example of the new ID line:
ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
(1) (2) (3) (4) (5) (6) (7)
The tokens represent:
1. Primary accession number.
2. 'SV' + sequence version number.
3. Topology: 'circular' or 'linear'.
4. Molecule type.
5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS,
STD, "normal" entries will have STD for standard).
6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV,
SYN, UNC, VRL, PHG).
7. Sequence length + 'BP.'.
The entry name is no longer displayed in the ID line.
A mapping file (entryname to accession number)
ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is
provided for those entries where the entryname is not the same as the
accession number.
The SV line has been dropped as sequence version information is now
displayed in the ID line.
In order to facilitate the changeover to the new ID line structure, two
small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They
can be used to convert EMBL flat files from the old to the new format and
vice-versa. The converters can be found at
ftp://ftp.ebi.ac.uk/pub/databases/embl/tools
A new version of the Syncron tools (for maintaining synchronised copies of
EMBL database updates) that became the working version with EMBL release 87
can be found in the same directory. In this version the tools were adjusted
to cope with the new format of the ID line in EMBL entries and some related
changes.
...
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of simon andrews (BI)
> Sent: Monday, July 24, 2006 8:34 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] New EMBL format parsing/writing
>
> I few weeks ago I saw a couple of messages on this list mentioning the
> new ID/SV line format used in the latest EMBL release. I'm in the
> process of moving our database server over to the new format and was
> looking to update SeqIO::embl.pm.
>
> I'm sure someone said they'd made a patch to fix up parsing of the new
> format, but I can't find it either in CVS or bugzilla.
>
> Rather than do this again myself can someone point me to an updated
> SeqIO::embl.pm please? If there isn't one then I'll look into making
> the patch myself.
>
> Since this is such a major change are there any plans to put out a new
> release with this fix included? I'm sure this will start to bite more
> people as the new format becomes more widely adopted.
>
>
> Cheers
>
> Simon.
>
> --
> Simon Andrews PhD
> Bioinformatics Group
> The Babraham Institute
>
> simon.andrews at bbsrc.ac.uk
> +44 (0) 1223 496463
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Mon Jul 24 12:37:32 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 11:37:32 -0500
Subject: [Bioperl-l] New EMBL format parsing/writing
In-Reply-To:
Message-ID: <002001c6af3f$76214490$15327e82@pyrimidine>
Great work! Does it support old and new EMBL or only the newest? I don't
have a problem with dumping old format support, but if we do we need to note
this in POD and elsewhere (wiki, perhaps).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com
> Sent: Monday, July 24, 2006 11:03 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] New EMBL format parsing/writing
>
> Simon,
>
> I have already updated SeqIO::embl.pm to support release 87. All I have
> left to do is generate the patch and update the /t test. I will try to
> get this submitted to bugzilla today (24 July).
>
> - David
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Mon Jul 24 14:40:03 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 13:40:03 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C4D0D3.1020506@sendu.me.uk>
Message-ID: <002f01c6af50$97242250$15327e82@pyrimidine>
I have to do a little catching up on things here; lots of conversation this
morning!
According to NCBI, the SOURCE line can hold organelle data, an abbreviated
version of the scientific name, and the GenBank common name in parentheses.
No other information is present.
The ORGANISM lines contains the scientific name (NCBI definition) and the
lineage, generally only ranked node but not always. I believe it was Nadeem
Faruque who indicated that there is some way that NCBI marks the ranks which
determines whether or not they appear in the lineage.
Here's what Bio::SeqIO::genbank does to get data into and out of GenBank
files:
------------------------------------------------------
Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species():
1) Bio::Species acts as a container object
2) The SOURCE data is dumped entirely into common_name() (ughhhh). There is
some additional work done as well before instantiating a Bio::Species ; if
it is considered an unknown organism there is no Bio::Species object
returned. We should get rid of that bit; every GenBank SOURCE has a TaxID
and therefore has a node, including plasmids and unknowns. There will be no
genus/species or anything else set for that group.
3) The ORGANISM name was divided up into genus(), species(), and
subspecies(), based on the classification array (again, ughhh).
4) The classification array is split into an array and dumped into
classification()
5) No parsing of potential organelle information occurs. None. Zero.
Squat.
6) TaxID is grabbed from the 'source' seqfeature and assigned via
ncbi_taxid(). We could use this to also grab the organelle, etc.
------------------------------------------------------
Bio::SeqIO::genbank in method write_seq():
1) SOURCE line : use the common_name data for output, but tag on the
subspecies information (?!?!?!).
2) ORGANISM lines : the name is rebuilt from the organelle() (which should
be on the SOURCE line) and genus and species, which comes from the
classification array (?!?!?!). The classification array is rebuilt from
classification()
------------------------------------------------------
Much of this may be cruft from changes in the official GenBank format that
we neglected to update.
However, I think there's WAY too much hand-wringing about trying to get
everything into genus() species() etc without anything more that the (very
scant) information in the flatfile, esp. when using the classification array
as a basis. The only places where reliable tax information is present in
the flatfile are:
1) SOURCE line (organelle, common name, abbreviated name)
2) ORGANISM lines (scientific name, classification array)
3) 'source' seqfeature (strain/variant (!), organelle, TaxID, etc found
here).
We should assign those accordingly; we could even use the 'source'
seqfeature to grab strain, organelle, etc. just like we now do for the
TaxID.
Beyond that we're really just guessing the ranks and the genus-species
names. Makes no sense, especially when that is easily available in
Bio::Taxonomy using entrez/flatfile. We could have Bio::Taxonomy::Species
act as a container for IO purpose, ONLY using the methods in the 'reliable
information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs.
Then hold the additional data with warnings attached if a lookup hasn't been
run, or not set them at all. Or, use Hilmar's suggestion and force the user
to use the db handle and ncbi_taxid() to grab a new
Bio::Taxonomy::Node/Species object (based on the rank) which has the correct
information.
As for the other container get/sets: species(), genus() etc.
These methods should be present, but only for species or below (hence
Bio::Taxonomy::Species). In a way Bio::Taxonomy::Species is not entirely
correct as the sequence file many times the sequence is from an organism at
the genus level (unassigned species) or subspecies/strain levels, or is
unranked (environmental samples, for instance). All of these seem to have
TaxIDs though. Don't think it really matters...
We could convert Bio::Species into an abstract interface class
(Bio::SpeciesI), moving the implemented methods over to
Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement
Bio::Taxonomy::NodeI or Bio::TaxonomyI as well. Bio::Taxonomy::Species
could be checked with
$obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI')
Or, modifying Hilmar's suggestion:
|-----Tax::Node
NodeI/TaxI -|
|-----Tax::Species
|
SpeciesI -------|
So Species doesn't 'contaminate' Node.
This will allow you to proceed with doing what you want to
Bio::Taxonomy::Node; both Node and Species could be checked simultaneously
though they need to be changed at some point to implement the same base
class, so you could check using :
if ($obj->isa('Bio::Taxonomy::NodeI')) {
As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species,
all I did was 'clone' the Bio::Taxonomy::Node module into
Bio::Taxonomy::Species, removed the warnings in species() and other methods
for the time being, and changed the method call for classification() in
Bio::SeqIO::genbank to send an array instead of an array_ref. Then I
modified the parsing to retain the scientific_name and abbreviated_name
(though the latter should go into common_names()). Passed all but one test,
where common_name was called and returned the entire SOURCE line (not
correct!). Pretty simple, really...
BTW, I checked EMBL format, and it is very similar in format to the way
GenBank is with the interesting addition of the OG line (for organelle).
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Monday, July 24, 2006 8:53 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Chris Fields wrote:
> > Bio::SeqIO::genbank works very happily with the current
> > Bio::Taxonomy::Node now; if we intend to remove most of the method we
> > need to have a similar DB-aware module to house the flatfile data (like
> > Bio::Species) yet be capable of working with Bio::Taxonomy (like
> Tax::Node).
>
> Can you give code examples of what Bio::SeqIO::genbank is doing and what
> makes it 'happy'? What are the requirements? Would it be as happy
> working with a Bio::Taxonomy object?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From cjfields at uiuc.edu Mon Jul 24 15:24:23 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 14:24:23 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C4CFF6.40609@sendu.me.uk>
Message-ID: <003c01c6af56$c5fd2df0$15327e82@pyrimidine>
> Hilmar Lapp wrote:
> > Sounds good to me, except there is no Bio::TaxonomyI yet,
>
> Indeed, I propose making one.
So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me
think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node
implements it.
...
> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that
> Bio::Species isa Bio::TaxonomyI:
>
> ...
> SOURCE Saccharomyces cerevisiae (baker's yeast)
> ORGANISM Saccharomyces cerevisiae
> Eukaryota; Fungi; Ascomycota; Saccharomycotina;
> Saccharomycetes;
> Saccharomycetales; Saccharomycetaceae; Saccharomyces.
>
> ...
>
> ## the fully-manual way
> my $species = new Bio::Species;
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
> -rank => 'species', -object_id => 1,
> -parent_id => 2);
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> -object_id => 2, -parent_id => 3);
> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined)
> my $n3 = [etc]
> $species->add_node($node);
> $species->add_node($n2);
> [etc]
Hrmm... why would you add multiple nodes to a species object? A Species
is-a Node, not a full Bio::Taxonomy. Taxonomy has-a Node (hence the
add_node() method). So, you should be able to add a NodeI-implementing
object to a Taxonomy object (either a Node or a Species).
Not sure I agree with what you propose here; doesn't seem right...
...
> We also solve Chris' earlier quandary:
>
> [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode
> exist, and given that Bio::DB::Taxonomy* currently directly make Node
> objects ]
> > The only problem I can foresee is which class to use with
> > Bio::DB::Taxonomy*? I guess one could settle on one class by default
> and
> > have the option to use another Bio::Taxonomy::NodeI-implementing class
> if
> > you wanted more data/methods available...
>
> The way to do it is to have the Bio::DB::Taxonomy* modules return only
> the information that a Bio::Taxonomy::FactoryI would need to make a
> NodeI. The specific Factory that you use could generate whatever type of
> Node you wanted.
Yes, using an object factory here makes a lot of sense, returning the
correct object type based on the rank.
...
> Bio::Species differs from Bio::Taxonomy only so it contains all the
> legacy methods names that Bio::Species currently has, for backward
> compatibility. Setting $species->classification() would delete all nodes
> of self, use a GenbankFactory to make a new Bio::Species, then pull out
> all its Nodes and add them to self.
The idea is to replace Bio::Species with something that works well, so
having it implement a Node-like interface works since it is-a Node. Having
it implement a Taxonomy-like interface, though, doesn't make a lot of sense
as a species is-not-a Taxonomy. It should act just like a fancier node
object.
Using a factory in Bio::DB::Taxonomy should solve any issues about what
object type is returned, since that could simply be made based on the rank
itself (species rank or below == Bio::Taxonomy::Species, genus and above ==
Bio::Taxonomy::Node).
> Unless anyone can think of a better way of doing things, I'll explore
> the above ideas and start writing code. To summarise: major changes to
> Bio::DB::Taxonomy* (make them factory slaves), implementation of some
> Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make
> Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI.
Nope. Don't agree. Sorry. I can't see why you would force a Species to be
a Taxonomy when it isn't. The object hierarchy doesn't make sense to me.
I would just have a simple interface for Node (NodeI), and either convert
Bio::Species to an abstract interface or place its methods in
Bio::Taxonomy::Species/SpeciesNode.
I like the interface idea as Bio::Taxonomy::Node is-a NodeI only, while
Bio::Taxonomy::Species is-a NodeI and SpeciesI; these checks can be run
using the UNIVERSAL object method 'isa' when using a Factory.
I'll repeat: a Node and a Species is-not-a Taxonomy. A Taxonomy object
has-a Node or Species or combinations thereof ; all would be
NodeI-implementing. That's the reason that add_node() is there, which could
be modified to allow only objects that isa->('Bio::Taxonomy::NodeI') (i.e. a
Node or a Species).
> Oh, Bio::Taxonomy might need some changes as well. It has a classify()
> method does something with a Bio::Species, which would be all wrong in
> the new way of doing things.
We'll have to make eventual changes to anything referencing Bio::Species to
get them to work correctly. Getting the object hierarchy finalized and
worked out is priority one. Getting Bio::SeqIO modules switched over to
Bio::Taxonomy::Species (pretty commonly used) and making sure that
Bio::DB::Taxonomy returns the correct objects from the factory is a close
second. Any small issues that pop up along the way can be taken care of
when they reveal themselves.
Chris
From cjfields at uiuc.edu Mon Jul 24 15:34:55 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 14:34:55 -0500
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net>
Message-ID: <003d01c6af58$3dc4ac40$15327e82@pyrimidine>
> > Maybe the file parser could have its own organelle() method
> > and leave all taxonomic classes without such a method. Or it could
> > stay
> > as is, I don't know.
>
> Like I said above, at the end of the day there needs to be a way to
> qualify a sequence by the genome it is part of.
Agreed. I think Sendu's right in one regard, it doesn't seem to have
anything to do with the taxonomy itself. See below...
There should be a way of containing this somehow, maybe using a
Bio::Annotation::SimpleValue object or having a get/set somehow.
> > Do different organelles in the same species get unique taxonomy ids?
>
> I would have to confirm, but I believe so. As I said, from a genome/
> sequence-centric viewpoint, the organelle and nuclear genomes are two
> different things.
Looks like the organelle sequence data uses the organism TaxID. I couldn't
find organelle-specific taxon information using the TaxBrowser for
mitochondrion, chloroplast, or plastid.
source 1..426
/organism="Reticulitermes tibialis"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:186107"
/haplotype="T9"
TaxID refers to the organism ("Reticulitermes tibialis"), not the
mitochondrion.
source 1..814
/organism="Porterinema fluviatile"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/strain="SAG 124.79"
/db_xref="taxon:246123"
/country="Germany"
TaxID refers to the organism ("Porterinema fluviatile"), not the
chloroplast.
Chris
From bix at sendu.me.uk Mon Jul 24 15:45:09 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 20:45:09 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine>
References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine>
Message-ID: <44C52345.5060903@sendu.me.uk>
Chris Fields wrote:
>> Hilmar Lapp wrote:
>>> Sounds good to me, except there is no Bio::TaxonomyI yet,
>> Indeed, I propose making one.
>
> So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me
> think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node
> implements it.
No no, I guess the whole rest of you reply was confused by this one
point. Bio::TaxonomyI would be the interface for Bio::Taxonomy.
Definitely not a Node.
>> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that
>> Bio::Species isa Bio::TaxonomyI:
>>
>> ...
>> SOURCE Saccharomyces cerevisiae (baker's yeast)
>> ORGANISM Saccharomyces cerevisiae
>> Eukaryota; Fungi; Ascomycota; Saccharomycotina;
>> Saccharomycetes;
>> Saccharomycetales; Saccharomycetaceae; Saccharomyces.
>>
>> ...
>>
>> ## the fully-manual way
>> my $species = new Bio::Species;
>> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
>> -rank => 'species', -object_id => 1,
>> -parent_id => 2);
>> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
>> -object_id => 2, -parent_id => 3);
>> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined)
>> my $n3 = [etc]
>> $species->add_node($node);
>> $species->add_node($n2);
>> [etc]
>
>
> Hrmm... why would you add multiple nodes to a species object? A Species
> is-a Node, not a full Bio::Taxonomy.
In my proposal, a Bio::Species certainly is a full Bio::Taxonomy.
>> Bio::Species differs from Bio::Taxonomy only so it contains all the
>> legacy methods names that Bio::Species currently has, for backward
>> compatibility. Setting $species->classification() would delete all nodes
>> of self, use a GenbankFactory to make a new Bio::Species, then pull out
>> all its Nodes and add them to self.
>
> The idea is to replace Bio::Species with something that works well, so
> having it implement a Node-like interface works since it is-a Node. Having
> it implement a Taxonomy-like interface, though, doesn't make a lot of sense
> as a species is-not-a Taxonomy.
Right. So this is why we've been 'butting heads'. Up till now I had no
idea why you were so adamant about keeping things the old
Bio::Taxonomy::Node way.
Bio::Species very definitely has never been, nor do we want it to
become, a single node of a taxonomy. It has always been a complete
taxonomy. You can tell that by the fact it has a classification, and you
could ask what its genus is.
This is why I'm proposing that Bio::Species become a Bio::Taxonomy.
Because that's the correct object model for the kinds of things
Bio::Species wants to do.
> Using a factory in Bio::DB::Taxonomy should solve any issues about what
> object type is returned, since that could simply be made based on the rank
> itself (species rank or below == Bio::Taxonomy::Species, genus and above ==
> Bio::Taxonomy::Node).
Frankly, that idea makes me ill. A Node, at the fundamental level, is
just a very simple object that needs to associated a taxonomic rank with
a scientific name. If you start making different objects for different
ranks, you've departed from any semblance of meaning in the object model.
> Nope. Don't agree. Sorry. I can't see why you would force a Species to be
> a Taxonomy when it isn't. The object hierarchy doesn't make sense to me.
Does it make sense now?
> I'll repeat: a Node and a Species is-not-a Taxonomy.
I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;)
> A Taxonomy object has-a Node or Species or combinations thereof ;
No, a Taxonomy contains Nodes. One of those Nodes might have a rank() of
'species'.
A Bio::Species contains Nodes. One of those Nodes definitely has a
rank() of 'species'. It /must/ have other nodes, because the job of
Bio::Species has in the past and will in the future be to store all the
other taxonomic levels in a Genbank file. For the same reason
Bio::Species can't be a Node itself, because you can't store other Nodes
inside a Node.
From cjfields at uiuc.edu Mon Jul 24 15:49:06 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 14:49:06 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net>
Message-ID: <003e01c6af5a$390cdea0$15327e82@pyrimidine>
Yes, 'largely' the key word. I don't really agree with Sendu's hierarchy
scheme (making Species implement Taxonomy and not Node doesn't make sense),
but, besides that, everything else seems fine. I like the following setup
(which is similar to what you proposed, I believe), which I already posted.
|-----Tax::Node
NodeI-------|
|-----Tax::SpeciesNode
|
SpeciesI -------|
Taxonomy::Node is-a NodeI
Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI
Bio::Taxonomy 'has-a' NodeI-implementing module
SeqIO has-a SpeciesI-implementing module
Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules;
specifically, a SpeciesNode for species ranks or below, and a Node for
anything else.
It would be nice to get this hammered out soon. I think we can actually
start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface
classes would be easy to add. I could work on getting SeqIO to work with
Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks).
Like I mentioned before, I got Bio::SeqIO::genbank already using it but
haven't committed it to CVS until we sorted out the class hierarchy and
interface-implementation issues.
I won't be able to add too much more to this for a few weeks, unfortunately.
I need to prepare for a conference as well as finish up a ton of bench
research. I'll try keeping up though...
Chris
> :-) I think we're largely in agreement. As for node_name() I fully
> understand the motivation, but it needs to be understood that the
> attribute's value will be based on a largely arbitrary choice unless
> it is set directly by the user.
>
> -hilmar
>
> On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote:
>
> > Hilmar Lapp wrote:
> >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
> >>
> >>> Bio::DB::Taxonomy::flatfile
> >>> ---------------------------
> >>> [...]
> >>>
> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it
> >>> makes the
> >>> division as a three letter code, like 'PRI'. However, for
> >>> consistency
> >>> with entrez and the scientific_name() of the node the division is
> >>> supposed to correspond to, it is now stored as the full name, like
> >>> 'Primates'.
> >>
> >> What about adding a method division_code() which would return the 3-
> >> letter abbreviation?
> >>
> >> The abbreviation may be needed by flat-file writers, so it may be
> >> handy to have in some cases.
> >
> > As far as I know you can't get the 3-letter version via entrez, so no
> > other module can really expect to be able to get it, not knowing which
> > database (flatfile.pm or entez.pm) the taxonomic information is
> > coming from.
> >
> > But of course it would be somewhat harmless to add division_code()
> > anyway. It might be better done as a -code => 1 option to division()?
> >
> >
> >>> The names->id solution also stores the artificially uniqued names
> >>> like
> >>> 'Craniata ', allowing you for the first time to
> >>> retrieve the
> >>> correct id. Previously the search would have simply failed
> >>> completely.
> >>>
> >>> The names->id solution now handles nodes with scientific names of
> >>> 'xyz
> >>> (class)', allowing you to retrieve the id with both get_taxonids
> >>> ('xyz')
> >>> and get_taxonids('xyz (class)'). Previously only the latter would
> >>> work.
> >>
> >> Should angle brackets be allowed too?
> >
> > Allowed in what sense? You can indeed search for both
> > get_taxonids('Craniata ') [returns a single id] and
> > get_taxonids('Craniata') [returns multipe ids, one of which is the
> > previous answer].
> >
> >
> >> Maybe there should also be a -names parameter which accepts a hash
> >> reference with keys being the kind of name (scientific, common, etc)
> >> and the values being array references with the set of names of that
> >> kind?
> >
> > Not sure what you mean. name() has that data structure, though you're
> > not supposed to set its hash ref directly.
> >
> >
> >>> or the $node->classification() array.
> >>
> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
> >> brought over from a flawed (because flat) object model in
> >> Bio::Species.
> >
> > Yes, I agree.
> >
> >
> >>> NOTE: entrez modules (and website) cannot cope with ''
> >>> in the
> >>> query, failing searches like 'Craniata '. For this
> >>> reason, if
> >>> get_taxonids() is given a query with '' it will
> >>> immediately
> >>> return undefined, saving a pointless website access.
> >>
> >> If there is a 'next-best-thing' that is still semantically compatible
> >> with the API documentation, I would do that.
> >>
> >> In this case, if there is a in the query the entrez
> >> module should strip it and automatically use the rest for searching.
> >> If indeed multiple IDs match there should be a warning to inform the
> >> user that entrez cannot use the notation to limit the
> >> query results.
> >
> > I wouldn't like this. I actually had it working this way initially,
> > but
> > decided that if someone entered 'xyz ' they really didn't
> > want multiple ids, expected to get multiple ids with just 'xyz' and
> > don't want their query made something else and then be warned about
> > it.
> >
> >
> >> In fact, you might as well provide an option to enable an automatic
> >> check for the correct branch for each ID if multiple ones are
> >> returned. I.e., if this option is enabled, the module would
> >> automatically query the parent nodes to see if is in the
> >> lineage, and if not will remove the respective ID from the result
> >> set. The reason you may want to make it optional is because it
> >> potentially costs time. (but in reality I'm not sure why a client
> >> will not want to enable the option - so maybe this should even be
> >> default)
> >
> > I can certainly add that, it seems like a good idea. I don't, however,
> > see any scope for an option at all. What would the option be called?
> > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
> > imho. If the user queries 'xyz ' with that option, they're
> > just going to have to do for themselves manually what the method would
> > have done for them without that option, in order to get the correct
> > answer. It'll be slower that way, if anything. So the option would
> > actually be called
> > -
> > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt
> > le_slower
> > (!).
> >
> >
> >>> Bio::Taxonomy::Node
> >>> -------------------
> >>> [...]
> >>> classification() has a proper solution to finding the classification
> >>> when the array wasn't manually set.
> >>>
> >>> # Improvements
> >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
> >>> ('common'). Now
> >>> it is an alias to name('scientific').
> >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
> >>> flatfile and entrez and user-created nodes now implicitly associate
> >>> the
> >>> name of the node they create with its scientific name.
> >>
> >> I'm not even sure node_name() should just be deprecated. The methods
> >> falsely suggests that there is only a single and definitive name for
> >> the taxon node.
> >>
> >> In NCBI reality, this is only true for the scientific name of the
> >> node. In real reality, many nodes have multiple scientific names -
> >> taxonomy isn't static and therefore the scientific naming of nodes
> >> isn't either.
> >
> > For the programmer not using any database but just making up his own
> > nodes, I think he needs a node_name() because he may not be thinking
> > about anything fancy or realistic. He just want to give his node a
> > single name that he invents. node_name() seems like the ideal method
> > name to me.
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Mon Jul 24 15:56:02 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 15:56:02 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine>
References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine>
Message-ID: <88700A84-B426-4BC7-88F2-D5E793870ADF@gmx.net>
On Jul 24, 2006, at 3:24 PM, Chris Fields wrote:
>
>> Hilmar Lapp wrote:
>>> Sounds good to me, except there is no Bio::TaxonomyI yet,
>>
>> Indeed, I propose making one.
>
> So, Node would implement this, correct?
No -
> Naming it Bio::TaxonomyI makes me
> think that Bio::Taxonomy implements TaxonomyI, not that
> Bio::Taxonomy::Node
> implements it.
I'd suppose so.
>> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that
>> Bio::Species isa Bio::TaxonomyI:
>>
>> ...
>> SOURCE Saccharomyces cerevisiae (baker's yeast)
>> ORGANISM Saccharomyces cerevisiae
>> Eukaryota; Fungi; Ascomycota; Saccharomycotina;
>> Saccharomycetes;
>> Saccharomycetales; Saccharomycetaceae; Saccharomyces.
>>
>> ...
>>
>> ## the fully-manual way
>> my $species = new Bio::Species;
>> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
>> cerevisiae',
>> -rank => 'species', -object_id
>> => 1,
>> -parent_id => 2);
>> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
>> -object_id => 2, -parent_id => 3);
>> # (no assumption that 'Saccharomyces' is the genus, so rank()
>> undefined)
>> my $n3 = [etc]
>> $species->add_node($node);
>> $species->add_node($n2);
>> [etc]
>
>
> Hrmm... why would you add multiple nodes to a species object? A
> Species
> is-a Node, not a full Bio::Taxonomy.
No. See above: Bio::Species is-a Bio::Taxonomy.
> Taxonomy has-a Node (hence the
> add_node() method). So, you should be able to add a NodeI-
> implementing
> object to a Taxonomy object (either a Node or a Species).
Let's keep Bio::Species and Taxonomy::Node separate. They look like
representing something similar but once you look at the Bio::Species
API (and a Genbank record) you realize they do not. Bio::Species is
more like an entire lineage and the species node all flattened out
into one.
I'm not sure Bio::Species would need to implement a Bio::TaxonomyI
interface; it may as well just use an implementation of it
internally. I'm not sure how Sendu wants to design this, but for sure
Bio::Taxonomy::Node should not be a Bio::Species, and the reverse
should rather be avoided too.
>> [..]
>> The way to do it is to have the Bio::DB::Taxonomy* modules return
>> only
>> the information that a Bio::Taxonomy::FactoryI would need to make a
>> NodeI. The specific Factory that you use could generate whatever
>> type of
>> Node you wanted.
>
> Yes, using an object factory here makes a lot of sense, returning the
> correct object type based on the rank.
Well, I don't think you'd want to create instances of different node
classes depending on the rank of the node. However, a particular
factory implementation may of course be free to do exactly that.
> ...
>> Bio::Species differs from Bio::Taxonomy only so it contains all the
>> legacy methods names that Bio::Species currently has, for backward
>> compatibility. Setting $species->classification() would delete all
>> nodes
>> of self, use a GenbankFactory to make a new Bio::Species, then
>> pull out
>> all its Nodes and add them to self.
>
> The idea is to replace Bio::Species with something that works well, so
> having it implement a Node-like interface works since it is-a
> Node. Having
> it implement a Taxonomy-like interface, though, doesn't make a lot
> of sense
> as a species is-not-a Taxonomy. It should act just like a fancier
> node
> object.
No, I'd really recommend against muddling up a taxonomy node model
with the Bio::Species legacy model.
Bio::Species is not a node at all. You may argue it's not a taxonomy
either. This is just one more reason for containing the Bio::Species
contagious disease of conflating disjoint concepts into one.
>
> Using a factory in Bio::DB::Taxonomy should solve any issues about
> what
> object type is returned, since that could simply be made based on
> the rank
> itself (species rank or below == Bio::Taxonomy::Species, genus and
> above ==
> Bio::Taxonomy::Node).
Bio::Taxonomy::Species was an invention of mine and - if created -
should not be used for anything else other than representing a
taxonomy node as a Bio::Species object iff necessary (i.e., if the
client really wants a Bio::Species object).
I'd actually like to see what Sendu would come up with. It sounds at
the very minimum like an excellent start.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Mon Jul 24 15:59:10 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 15:59:10 -0400
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <003d01c6af58$3dc4ac40$15327e82@pyrimidine>
References: <003d01c6af58$3dc4ac40$15327e82@pyrimidine>
Message-ID: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net>
On Jul 24, 2006, at 3:34 PM, Chris Fields wrote:
> Looks like the organelle sequence data uses the organism TaxID.
Then you might as well store it as annotation. Really the only thing
that matters is that the flat file writers can get from an expected
location.
In fact storing as annotation is better e.g. for Biosql since right
now the taxonomy model is the NCBI model and so organelle will not be
stored (and hence neither be round-tripped).
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Mon Jul 24 16:10:20 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 15:10:20 -0500
Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
In-Reply-To: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net>
Message-ID: <000001c6af5d$3094b830$15327e82@pyrimidine>
Sounds good. Will be easy to change this over.
Chris
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Monday, July 24, 2006 2:59 PM
> To: Chris Fields
> Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::Species/Bio::Taxonomy changes
>
>
> On Jul 24, 2006, at 3:34 PM, Chris Fields wrote:
>
> > Looks like the organelle sequence data uses the organism TaxID.
>
> Then you might as well store it as annotation. Really the only thing
> that matters is that the flat file writers can get from an expected
> location.
>
> In fact storing as annotation is better e.g. for Biosql since right
> now the taxonomy model is the NCBI model and so organelle will not be
> stored (and hence neither be round-tripped).
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
From hlapp at gmx.net Mon Jul 24 16:12:39 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 16:12:39 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003e01c6af5a$390cdea0$15327e82@pyrimidine>
References: <003e01c6af5a$390cdea0$15327e82@pyrimidine>
Message-ID: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net>
On Jul 24, 2006, at 3:49 PM, Chris Fields wrote:
> Yes, 'largely' the key word. I don't really agree with Sendu's
> hierarchy
> scheme (making Species implement Taxonomy and not Node doesn't make
> sense),
> but, besides that, everything else seems fine. I like the
> following setup
> (which is similar to what you proposed, I believe), which I already
> posted.
>
> |-----Tax::Node
> NodeI-------|
> |-----Tax::SpeciesNode
> |
> SpeciesI -------|
>
> Taxonomy::Node is-a NodeI
> Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI
I don't even think we would need SpeciesI - why would a species-
ranked taxonomy node be so different from any other node such that it
would need its own interface.
Chris - just one suggestion: take a step back and imagine a Bioperl
in which Bio::Species had never existed. Instead, only taxonomy nodes
existed, and code that can effectively deal with them, including
filtering by rank. In this picture, what would you make to want to
introduce SpeciesI and Bio::Species?
Frankly, I don't see anything. I.e., the only reason is backward
compatibility (which is a valid reason), but let's not glorify
Bio::Species by adding ill-conceived interfaces.
>
> Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules;
> specifically, a SpeciesNode for species ranks or below, and a Node for
> anything else.
Like I said before, SpeciesNode or whatever it's called would draw
its right of existence solely from backward compatibility - don't use
it for anything else. And if you can achieve backward compatibility
by other means, don't even create a SpeciesNode.
My $0.02 ...
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Mon Jul 24 17:34:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 16:34:29 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net>
Message-ID: <000101c6af68$f27521a0$15327e82@pyrimidine>
> I don't even think we would need SpeciesI - why would a species-
> ranked taxonomy node be so different from any other node such that it
> would need its own interface.
>
> Chris - just one suggestion: take a step back and imagine a Bioperl
> in which Bio::Species had never existed. Instead, only taxonomy nodes
> existed, and code that can effectively deal with them, including
> filtering by rank. In this picture, what would you make to want to
> introduce SpeciesI and Bio::Species?
Argh!!! Just when I thought I could pull away...
Okay. I thought it would be nice to have a class that could accomplish two
things:
1) Act as a container for GenBank taxonomy information;
Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for
Bio::Species.
2) Also act as a bridge, so you had the option to retrieve the Species
object from a sequence object and have it act like a Node (be db-aware
out-of-the-box, so to speak).
Also, I'm trying to follow the original idea as proposed by Jason (this is
from perldoc Bio::Taxonomy::Node):
DESCRIPTION
This is the next generation (for Bioperl) of representing Taxonomy
information. Previously all information was managed by a single object
called Bio::Species. This new implementation allows representation of
the intermediate nodes not just the species nodes and can relate their
connections.
Which, to me, indicated that this would eventually replace Bio::Species (so,
in effect, must at least contain the relevant data for sequence objects w/o
being completely reliant on DB, yet still be DB-aware). Everything about
Bio::Species on the wiki also leads me to believe that this was the original
intent for Bio::Taxonomy::Node.
http://www.bioperl.org/wiki/Module:Bio::Species
http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data
And all the original methods (genus(), species(), etc.) also seem to
indicate this.
That's really it. I could give a toss about getting taxonomy information
directly from Bio::Species. And you're right: in hindsight Bio::Species is
flawed. However, it seemed from the beginning of this discussion with Sendu
and the proposed changes, that Bio::Species should stick around in some
capacity but should also be involved with Bio::Taxonomy (contrary to Jason's
idea above). Now I'm hearing something completely different (Sendu still
argues that it should be involved).
I had originally wanted to start delegating everything over to
Taxonomy::Node about a month ago, when I found that it was remarkably easy
to do so. However, when Sendu proposed making changes to remove methods in
Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would
prevent an easy transition over to Node, I felt that it would be harder to
effectively have it take over for Bio::Species when parsing SeqIO objects
(all the calls to genus/species/subspecies etc methods would have to be
removed from all the classes which use Bio::Species). Hence
Bio::Taxonomy::Species as a compromise. Now it turns out no one wants to
have either Bio::Species (your 'contagion' references clues me in there) or
Bio::Taxonomy::Species.
If we think it would be better to completely toss all this out the window
and use only a bare-bones Node, then I'm fine with that. But if we go that
route we should just get rid of the Bio::Species 'disease' completely and
have things be much simpler. Simple is good!
I think Node can still act as a viable container class for the tax data from
a GenBank file (it's original purpose) as long as it has the very basic
methods for doing so. That would require:
scientific_name() - ORGANISM line data
common_names() - which could hold common names (in parentheses on the SOURCE
line) and the abbreviated name (from the SOURCE line)
ncbi_taxid() - from the 'source' seqfeature (already there).
The lineage information and organelle information could be stored in Node or
in SimpleValue objects. My vote is for the latter as there's no need for a
classification() container for Node, which you have repeatedly pointed out.
> Frankly, I don't see anything. I.e., the only reason is backward
> compatibility (which is a valid reason), but let's not glorify
> Bio::Species by adding ill-conceived interfaces.
I think we should just get rid of Bio::Species completely. We would need to
go in and rework species parsing in the SeqIO modules that use Bio::Species,
but that would only make things simpler, not more complex. Get rid of
trying to figure out what is a genus or species based on the GenBank
information only, and have the bridge between the sequences be stored in a
Taxonomy::Node object (which should contain the NCBI TaxID, so then it can
use the associated DB object to traverse up and down other nodes). The
interface idea was a proposed compromise i.e. my 'bridge' between GenBank
taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought
was Jason's original intent for Bio::Taxonomy::Node. Nothing more.
> > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules;
> > specifically, a SpeciesNode for species ranks or below, and a Node for
> > anything else.
>
> Like I said before, SpeciesNode or whatever it's called would draw
> its right of existence solely from backward compatibility - don't use
> it for anything else. And if you can achieve backward compatibility
> by other means, don't even create a SpeciesNode.
Agreed. But, if there is such venom towards Bio::Species, why not put it
out of it's misery as well? Seems like it has outlived it's usefulness.
Chris
From cjfields at uiuc.edu Mon Jul 24 17:53:46 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 16:53:46 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C52345.5060903@sendu.me.uk>
Message-ID: <000201c6af6b$a4534580$15327e82@pyrimidine>
> > I'll repeat: a Node and a Species is-not-a Taxonomy.
>
> I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;)
Nope. I think this is incorrect. Here's why.
Let's look at the reasons Bio::Taxonomy was started, shall we?
>From perldoc Bio::Taxonomy:
DESCRIPTION
Bio::Taxonomy object represents any rank-level in taxonomy system,
rather than Bio::Species which is able to represent only species-level.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>From perldoc Bio::Taxonomy::Node
DESCRIPTION
This is the next generation (for Bioperl) of representing Taxonomy
information. Previously all information was managed by a single object
called Bio::Species. This new implementation allows representation of
the intermediate nodes not just the species nodes and can relate their
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
connections.
Bioperl wiki:
http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data
http://www.bioperl.org/wiki/Module:Bio::Species
Both talk about delegating or replacing Bio::Species with
Bio::Taxonomy::Node.
Everyone of those indicates what the original idea for Bio::Taxonomy::Node
was (eventual replacement for Bio::Species). Even the original methods for
Bio::Taxonomy::Node are the same. So, according to this alone, Bio::Species
would eventually be replaced by Bio::Taxonomy::Node.
I wanted an easier transition to Node from Bio::Species (hell, just a few
changes and using Bio::Taxonomy::Node worked fine!) , but your proposals
made sense. I saw having a Species-based Tax object as a nice compromise,
but Hilmar has made a few good points: would we have a Bio::Species object
around knowing what we know now? When Bio::Species was originally designed,
it was probably before the NCBI Tax database existed. I think it has
outlasted its current use.
I have posted a response to Hilmar. I think we should just get rid of
Bio::Species altogether and have a Taxonomy::Node contain the basic data
(scientific_name(), common_names(), etc). And remove any SeqIO parsing of
genus/species to simplify everything. All this extra parsing and
hand-wringing over trying to get species/genus information from a GenBank
file just mucks up ORGANISM and SOURCE line parsing anyway. Simplify it.
Simple is good.
Radical? Yes, but I agree with him that Bio::Species has outlasted it's
use. As for organelle and lineage information, they could be placed in
SimpleValue objects. If anyone wants to grab tax information, they can use
the Node object to get it but they'll need a local flatfile database or
network connection to do so. This also means there is no need for a
Bio::DB::Taxonomy factory: just return Node objects directly. Each format
(flatfile and entrez) currently works this way anyway, correct? Simplifies
that. Simple is better.
Of course, we couldn't get rid of Bio::Species until all the following were
shifted over to Node somehow: ; >
Instances: 2 BP Module : Bio::Cluster::SequenceFamily
Instances: 4 BP Module : Bio::Cluster::UniGene
Instances: 1 BP Module : Bio::Cluster::UniGeneI
Instances: 1 BP Module : Bio::DB::FileCache
Instances: 3 BP Module : Bio::DB::GFF::Segment
Instances: 1 BP Module : Bio::DB::Taxonomy::flatfile
Instances: 2 BP Module : Bio::Graph::IO::psi_xml
Instances: 1 BP Module : Bio::Map::CytoMap
Instances: 1 BP Module : Bio::Map::LinkageMap
Instances: 3 BP Module : Bio::Map::MapI
Instances: 3 BP Module : Bio::Map::SimpleMap
Instances: 3 BP Module : Bio::Matrix::PSM::InstanceSite
Instances: 6 BP Module : Bio::Phenotype::Correlate
Instances: 1 BP Module : Bio::Phenotype::OMIM::OMIMentry
Instances: 3 BP Module : Bio::Phenotype::OMIM::OMIMparser
Instances: 5 BP Module : Bio::Phenotype::Phenotype
Instances: 2 BP Module : Bio::Phenotype::PhenotypeI
Instances: 4 BP Module : Bio::Seq
Instances: 3 BP Module : Bio::SeqI
Instances: 2 BP Module : Bio::SeqIO::agave
Instances: 4 BP Module : Bio::SeqIO::bsml
Instances: 2 BP Module : Bio::SeqIO::bsml_sax
Instances: 1 BP Module : Bio::SeqIO::chadoxml
Instances: 1 BP Module : Bio::SeqIO::chaos
Instances: 4 BP Module : Bio::SeqIO::embl
Instances: 2 BP Module : Bio::SeqIO::entrezgene
Instances: 3 BP Module : Bio::SeqIO::game::seqHandler
Instances: 4 BP Module : Bio::SeqIO::genbank
Instances: 2 BP Module : Bio::SeqIO::kegg
Instances: 2 BP Module : Bio::SeqIO::locuslink
Instances: 4 BP Module : Bio::SeqIO::swiss
Instances: 2 BP Module : Bio::SeqIO::table
Instances: 2 BP Module : Bio::SeqIO::tigr
Instances: 2 BP Module : Bio::SeqIO::tigrxml
Instances: 7 BP Module : Bio::SeqIO::tinyseq
Instances: 4 BP Module : Bio::Taxonomy
Instances: 1 BP Module : Bio::Taxonomy::Node
Instances: 6 BP Module : Bio::Taxonomy::Taxon
Instances: 9 BP Module : Bio::Taxonomy::Tree
Instances: 5 BP Module : Bio::Tools::Analysis::Protein::ELM
Chris
From bix at sendu.me.uk Mon Jul 24 18:15:31 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Mon, 24 Jul 2006 23:15:31 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <000101c6af68$f27521a0$15327e82@pyrimidine>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
Message-ID: <44C54683.70707@sendu.me.uk>
Chris Fields wrote:
>
> Also, I'm trying to follow the original idea as proposed by Jason (this is
> from perldoc Bio::Taxonomy::Node):
>
> Which, to me, indicated that this would eventually replace Bio::Species
Well, we don't really know that Jason didn't later change his mind, but
in any case it doesn't make sense (anymore, given that we have
Bio::Taxonomy).
In a direct reply to me you point out specific passages in the current
docs that explain why you have thought we should delegate or replace
Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are
not something we are forced to blindly follow. We decide for ourselves
if they make sense, we decide for ourselves if there is a better way of
doing it, and then we do it the best way.
So if you ignore what those old bits of documentation say, just pretend
you never ever read them, would my proposals make sense or not? Since
those old proposals were never implemented we have no reason to try and
stick with them if there is a better proposal.
And for the record, '...Bio::Species which is able to represent only
species-level' can (correctly) be interpreted as 'Bio::Species is only
supposed to be used for representing a taxonomy that includes the
species-level'. You can't interpret it literally because Bio::Species is
used for levels below species, and also represents all the levels above
species-level as well. Either Jason got it wrong when he wrote that, or
you have misinterpreted it.
Likewise, let's play the interpretation game again: 'Previously all
information was managed by a single object called Bio::Species. [the
Bio::Taxonomy::Node] implementation allows representation of the
intermediate nodes not just the species nodes'. Note the apposition of
'single object' vs implication of multiple Node objects to do the same
job. I imagine at the time Jason wrote that there was no Bio::Taxonomy,
no holder for multiple Nodes.
> I had originally wanted to start delegating everything over to
> Taxonomy::Node about a month ago, when I found that it was remarkably easy
> to do so. However, when Sendu proposed making changes to remove methods in
> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would
> prevent an easy transition over to Node,
But an equally easy transition to Bio::Taxonomy instead. I don't know
why you would care about the name of the class we switch to. My concern
is that when the switch is made it makes sense.
> If we think it would be better to completely toss all this out the window
> and use only a bare-bones Node, then I'm fine with that. But if we go that
> route we should just get rid of the Bio::Species 'disease' completely and
> have things be much simpler. Simple is good!
>
> I think Node can still act as a viable container class for the tax data from
> a GenBank file (it's original purpose) as long as it has the very basic
> methods for doing so. That would require:
>
> scientific_name() - ORGANISM line data
> common_names() - which could hold common names (in parentheses on the SOURCE
> line) and the abbreviated name (from the SOURCE line)
> ncbi_taxid() - from the 'source' seqfeature (already there).
>
> The lineage information and organelle information could be stored in Node or
> in SimpleValue objects. My vote is for the latter as there's no need for a
> classification() container for Node, which you have repeatedly pointed out.
No, this is the whole point. The lineage information can NOT be stored
in a Node (unless you absuse Node by having all those crufty methods
like genus() and classification()), and why would we store it in
SimpleValue objects when we have Bio::Taxonomy?
Bio::Taxonomy is completely perfect for storing the taxonomic
information from a GenBank file. That's all you need to worry about. Can
we represent the data correctly? Yes. Do we gain all the good things
about a pure Bio::Taxonomy? Yes. Can we still do everything we used to
be able to do? Yes.
> I think we should just get rid of Bio::Species completely.
There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy
with backward-compatible methods. No harm done, all good.
I'll tell you what. This will be easier if I just write the code for my
proposals, including whatever changes would be needed in
Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is,
and hopefully everyone will be happy.
Perhaps you could just hold off doing any similar-but-contradictory work
until then.
From hlapp at gmx.net Mon Jul 24 19:47:10 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 19:47:10 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C54683.70707@sendu.me.uk>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
Message-ID: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net>
On Jul 24, 2006, at 6:15 PM, Sendu Bala wrote:
> I'll tell you what. This will be easier if I just write the code
> for my
> proposals, including whatever changes would be needed in
> Bio::SeqIO::genbank et al.
Never get in the way of somebody who threatens to code :-) so I
certainly won't. I think you're on the right track.
My suggestion is, if you have a good picture in front of you of how
it's going to look like when done, just pretend for a second it is
done already and give us some code examples that use the new (to be
done) API.
As a start, some of the situations it's currently used in:
- genbank.pm parsing and setting species information for the sequence
- user asking for the scientific name of the species of the sequence
(obviously, the call would remain unchanged: $seq->species->binomial
(). But what happens behind the scene?)
- genbank.pm writing the SOURCE information for a sequence
Replace genbank.pm with your rich annotation source parser of choice.
Then maybe some advanced uses:
- from a sequence stream, retain only those of primates
- like above, but only mitochondrial sequences
- for an organism, query entrez for all sequences of strains,
varieties, or subspecies sequences for that organism
Add your own if these sound stupid ...
Just an idea.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Mon Jul 24 22:06:16 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 21:06:16 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
<86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net>
Message-ID: <4678548F-ABEC-4E14-AD7F-D282D2DC2730@uiuc.edu>
>
>> I'll tell you what. This will be easier if I just write the code
>> for my
>> proposals, including whatever changes would be needed in
>> Bio::SeqIO::genbank et al.
>
> Never get in the way of somebody who threatens to code :-) so I
> certainly won't. I think you're on the right track.
Fine by me. My only request: I don't want every sequence passing
through SeqIO having an automatic DB lookup performed on it. SeqIO
parsing of GenBank files is slow enough as it is w/o enforcing
lookups, even if they are cached.
If you want lookups, have it as an option and not as default
behavior. We could have the option for a lookup added pretty easily
in genbank.pm _initialize or the main SeqIO constructor as a simple
Boolean flag. That might be pretty nice.
...
> (). But what happens behind the scene?)
> - genbank.pm writing the SOURCE information for a sequence
You know, the only really divisive point here is the lineage data and
how to store it in _read_GenBank_Species or reproduce it in write_seq
(). Again, I don't think we should have a forced lookup for this; it
should just be stored as is, either in Node or SimpleValue. Again, I
think the latter as everyone seems averse to containing this in Node.
> Then maybe some advanced uses:
>
> - from a sequence stream, retain only those of primates
> - like above, but only mitochondrial sequences
> - for an organism, query entrez for all sequences of strains,
> varieties, or subspecies sequences for that organism
For the primate example, would you screen those out via the in-file
lineage or using lookups?
Something like '$seqout->write_seq($seq) if ($seq->species->organelle
eq 'mitochondrion');' for the mitochondria example, which would mean
leaving organelle() in Species/Node or whatever is used.
The last one, I think, can be done w/o using the sequence directly
using NCBI's ELink and the TaxID to cross-reference the nucleotide
database. You would probably have to walk through all child nodes,
but it's feasible that way.
> Add your own if these sound stupid ...
>
> Just an idea.
>
> -hilmar
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From cjfields at uiuc.edu Mon Jul 24 22:29:57 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Mon, 24 Jul 2006 21:29:57 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C54683.70707@sendu.me.uk>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
Message-ID: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
Look, we're just going back and forth on this stupid little thing,
when the only point we really are divided on is what object type we
should store certain items in a GenBank file (Bio::Species/
Bio::Tax::Node/Bio::Whatever). In particular, the main sticking
point is the lineage.
We could go back and forth on what Jason really intended.
Personally, I think his past statements are quite clear on what his
intent was (he's very clear in the wiki on what Bio::Taxonomy::Node
was built to replace, in two separate posts and within the last four
months). The reality is he's not here and you're willing to do the job.
There is one thing I will make perfectly clear here: there should
never, ever be enforced lookups for SeqIO (even using caches), though
I have no problem having optional ones. This is something I have
stated before and what you propose below steers dangerously in that
direction. Where, for instance, do you store the lineage from a
GenBank file? Do you want to do a series of Tax lookups to restore
that data? I think that the number one complaint for sequence
parsing is speed, which would only get slower with lookups (even
cached).
What I propose is we make it as simple as possible. Remove the
unnecessary genus/species/subspecies parsing in genbank.pm, store the
scientific name, common names, and lineage in some easily accessible
way to make it easier for everyday users to use, have it tied to
Bio::Taxonomy in some way (I propose Node, as it contains almost all
the methods needed) so that you could get more information by moving
up and down nodes, or retrieve more information. I, personally,
don't see the point in having Bio:Species around after this
discussion as Node seems to do the job adequately.
My last word (I will be exiting this discussion and the group for two
weeks):
This would have been MUCH easier if all three of us could have gone
to the local bar for a beer and discussed it. We should just take
the time out to videoconference next time.
Chris
> Chris Fields wrote:
>>
>> Also, I'm trying to follow the original idea as proposed by Jason
>> (this is
>> from perldoc Bio::Taxonomy::Node):
>>
>> Which, to me, indicated that this would eventually replace
>> Bio::Species
>
> Well, we don't really know that Jason didn't later change his mind,
> but
> in any case it doesn't make sense (anymore, given that we have
> Bio::Taxonomy).
>
> In a direct reply to me you point out specific passages in the current
> docs that explain why you have thought we should delegate or replace
> Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are
> not something we are forced to blindly follow. We decide for ourselves
> if they make sense, we decide for ourselves if there is a better
> way of
> doing it, and then we do it the best way.
>
> So if you ignore what those old bits of documentation say, just
> pretend
> you never ever read them, would my proposals make sense or not? Since
> those old proposals were never implemented we have no reason to try
> and
> stick with them if there is a better proposal.
>
> And for the record, '...Bio::Species which is able to represent only
> species-level' can (correctly) be interpreted as 'Bio::Species is only
> supposed to be used for representing a taxonomy that includes the
> species-level'. You can't interpret it literally because
> Bio::Species is
> used for levels below species, and also represents all the levels
> above
> species-level as well. Either Jason got it wrong when he wrote
> that, or
> you have misinterpreted it.
>
> Likewise, let's play the interpretation game again: 'Previously all
> information was managed by a single object called Bio::Species. [the
> Bio::Taxonomy::Node] implementation allows representation of the
> intermediate nodes not just the species nodes'. Note the apposition of
> 'single object' vs implication of multiple Node objects to do the same
> job. I imagine at the time Jason wrote that there was no
> Bio::Taxonomy,
> no holder for multiple Nodes.
>
>
>> I had originally wanted to start delegating everything over to
>> Taxonomy::Node about a month ago, when I found that it was
>> remarkably easy
>> to do so. However, when Sendu proposed making changes to remove
>> methods in
>> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would
>> prevent an easy transition over to Node,
>
> But an equally easy transition to Bio::Taxonomy instead. I don't know
> why you would care about the name of the class we switch to. My
> concern
> is that when the switch is made it makes sense.
>
>
>> If we think it would be better to completely toss all this out the
>> window
>> and use only a bare-bones Node, then I'm fine with that. But if
>> we go that
>> route we should just get rid of the Bio::Species 'disease'
>> completely and
>> have things be much simpler. Simple is good!
>>
>> I think Node can still act as a viable container class for the tax
>> data from
>> a GenBank file (it's original purpose) as long as it has the very
>> basic
>> methods for doing so. That would require:
>>
>> scientific_name() - ORGANISM line data
>> common_names() - which could hold common names (in parentheses on
>> the SOURCE
>> line) and the abbreviated name (from the SOURCE line)
>> ncbi_taxid() - from the 'source' seqfeature (already there).
>>
>> The lineage information and organelle information could be stored
>> in Node or
>> in SimpleValue objects. My vote is for the latter as there's no
>> need for a
>> classification() container for Node, which you have repeatedly
>> pointed out.
>
> No, this is the whole point. The lineage information can NOT be stored
> in a Node (unless you absuse Node by having all those crufty methods
> like genus() and classification()), and why would we store it in
> SimpleValue objects when we have Bio::Taxonomy?
>
> Bio::Taxonomy is completely perfect for storing the taxonomic
> information from a GenBank file. That's all you need to worry
> about. Can
> we represent the data correctly? Yes. Do we gain all the good things
> about a pure Bio::Taxonomy? Yes. Can we still do everything we used to
> be able to do? Yes.
>
>
>> I think we should just get rid of Bio::Species completely.
>
> There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy
> with backward-compatible methods. No harm done, all good.
>
>
> I'll tell you what. This will be easier if I just write the code
> for my
> proposals, including whatever changes would be needed in
> Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is,
> and hopefully everyone will be happy.
>
> Perhaps you could just hold off doing any similar-but-contradictory
> work
> until then.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From hlapp at gmx.net Mon Jul 24 23:31:41 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 24 Jul 2006 23:31:41 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
Message-ID: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net>
On Jul 24, 2006, at 10:29 PM, Chris Fields wrote:
> [...]
> We could go back and forth on what Jason really intended. [...] The
> reality is he's not here and you're willing to do the job.
Right. And, knowing Jason, I think he'd be perfectly fine with seeing
his original idea develop in a possibly different direction, provided
it will all work nicely in the end. I'm willing to take the beating
on me if that doesn't turn out to be true ...
>
> There is one thing I will make perfectly clear here: there should
> never, ever be enforced lookups for SeqIO (even using caches),
You certainly don't want taxonomy lookups during the parsing stage,
and also not for the client requesting properties of the species that
have been parsed with high confidence, i.e., genus and species for a
straightforward binomial like 'Homo sapiens'.
Writing sequences, IMHO, doesn't have to be as fast. It may be better
to emit strict format a bit slower rather than sloppy format a bit
faster.
Upon parsing, one idea could be for the flat file parser to set a
dirty bit in the parsed out species if the parsed text didn't follow
strict binomial conventions, hence the parser may have made a mistake
and if a client requests the information it is better to lookup the
correct values from a taxonomy database. I.e., you could try with a
strict regex first that would imply a high-confidence result. If that
fails you don't give up but mark the result as untrustworthy.
> [...]
> This would have been MUCH easier if all three of us could have gone
> to the local bar for a beer and discussed it. We should just take
> the time out to videoconference next time.
You're not honestly suggesting that a videoconference is better than
having beer together?
Enjoy your trip, and thanks for hanging in there in the discussion, I
appreciate it.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Tue Jul 25 01:53:33 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 00:53:33 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
<49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net>
Message-ID: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu>
So do we intend on having everyone who installs bioperl have a local
copy of the taxonomy dumpfile? Or perform a remote lookup via
Entrez? Seems a bit extreme.
I would like the option of not having the lookup run; as I mentioned
to Sendu, one of the biggest complaints about bioperl is speed.
Additional lookups won't help on that end.
Chris
On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote:
>
> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote:
>
>> [...]
>> We could go back and forth on what Jason really intended. [...] The
>> reality is he's not here and you're willing to do the job.
>
> Right. And, knowing Jason, I think he'd be perfectly fine with seeing
> his original idea develop in a possibly different direction, provided
> it will all work nicely in the end. I'm willing to take the beating
> on me if that doesn't turn out to be true ...
>
>>
>> There is one thing I will make perfectly clear here: there should
>> never, ever be enforced lookups for SeqIO (even using caches),
>
> You certainly don't want taxonomy lookups during the parsing stage,
> and also not for the client requesting properties of the species that
> have been parsed with high confidence, i.e., genus and species for a
> straightforward binomial like 'Homo sapiens'.
>
> Writing sequences, IMHO, doesn't have to be as fast. It may be better
> to emit strict format a bit slower rather than sloppy format a bit
> faster.
>
> Upon parsing, one idea could be for the flat file parser to set a
> dirty bit in the parsed out species if the parsed text didn't follow
> strict binomial conventions, hence the parser may have made a mistake
> and if a client requests the information it is better to lookup the
> correct values from a taxonomy database. I.e., you could try with a
> strict regex first that would imply a high-confidence result. If that
> fails you don't give up but mark the result as untrustworthy.
>
>
>> [...]
>> This would have been MUCH easier if all three of us could have gone
>> to the local bar for a beer and discussed it. We should just take
>> the time out to videoconference next time.
>
> You're not honestly suggesting that a videoconference is better than
> having beer together?
>
> Enjoy your trip, and thanks for hanging in there in the discussion, I
> appreciate it.
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Tue Jul 25 03:05:23 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 25 Jul 2006 08:05:23 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
Message-ID: <44C5C2B3.1020304@sendu.me.uk>
Chris Fields wrote:
>
> There is one thing I will make perfectly clear here: there should
> never, ever be enforced lookups for SeqIO (even using caches), though
> I have no problem having optional ones. This is something I have
> stated before and what you propose below steers dangerously in that
> direction. Where, for instance, do you store the lineage from a
> GenBank file? Do you want to do a series of Tax lookups to restore
> that data? I think that the number one complaint for sequence
> parsing is speed, which would only get slower with lookups (even
> cached).
I already gave a code example of exactly how Bio::Taxonomy is perfect
for storing the lineage data in a GenBank file with or without a
database lookup. I think perhaps at the time you first read this you
basically ignored it because you had trouble with the idea of adding
nodes to a species. If you have been glossing over my argument, it may
be instructive to go over what I've been saying with a clear eye.
Anyway, here it is again, and remember in this example, Bio::Species isa
Bio::Taxonomy:
## the fully-manual way
my $species = new Bio::Species;
my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
-rank => 'species', -object_id => 1,
-parent_id => 2);
my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
-object_id => 2, -parent_id => 3);
# (no assumption that 'Saccharomyces' is the genus, so rank() undefined)
my $n3 = [etc]
$species->add_node($node);
$species->add_node($n2);
[etc]
## Using a factory without db access
# assume that Bio::Taxonomy::GenbankFactory implements
# some modified Bio::Taxonomy::FactoryI
my $factory = Bio::Taxonomy::GenbankFactory->new();
my $species = $factory->generate(-classification => ['Saccharomyces
cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]);
# the generate() method above just does the fully-manual way for you
## Using a factory with db access
# assume that Bio::Taxonomy::EntrezFactory implements some
# modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
# to get the nodes
my $factory = Bio::Taxonomy::EntrezFactory->new();
my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
cerevisiae');
So now do you see how we're able to do the Genbank no-db way and the
db-using way with the same object model? We're able to do it the same,
sane way because a Node is just a node; you can make them yourself
manually, or retrieve them from a database. Once you stick them in a
Taxonomy you can then (potentially) ask all the questions of the data
that you can with existing Bio::Species. No cruft is required anywhere
at all. All the Taxonomy classes can be 'pure', while only Bio::Species
has to have backward-compatibility methods.
From bernd.web at gmail.com Tue Jul 25 06:47:50 2006
From: bernd.web at gmail.com (Bernd Web)
Date: Tue, 25 Jul 2006 12:47:50 +0200
Subject: [Bioperl-l] Structure::IO
Message-ID: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com>
Hi,
Does someone have experience with Bio::Structure::IO?
The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the
chain() method of Bio::Structure::Entry doing? The POD states:
Title : chain
Usage : @chains = $structure->chain($chain);
Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry.
Returns : list of Bio::Structure::Residue objects
Args : One Residue or a reference to an array of Residue objects
But in e.g
my $stream = Bio::Structure::IO->new(-file => $filename,
-format => 'pdb');
while ( my $struc = $stream->next_structure() ) {
for my $chain ($struc->get_chains) {
my $chainid = $chain->id;
my @chains = $struc->chain($chain);
}
}
I get Bio::Structure::Chain=HASH(0x9f1ab50).
What is the function of the chain method and how to use it?
Best regards,
bernd
From bernd.web at gmail.com Tue Jul 25 07:44:28 2006
From: bernd.web at gmail.com (Bernd Web)
Date: Tue, 25 Jul 2006 13:44:28 +0200
Subject: [Bioperl-l] SeqUtils
Message-ID: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com>
Hi,
With Bio::SeqUtils it may be nice to support 3 letter codes with
capitals only, too.
Now
my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER');
will give in $string->seq: XXX.
Possibly the capitals in MetGlyTer are used to find the amino acids codes?
If not maybe it's easy to implement case-insensitive, or all-capitals
for AA codes in SeqUtils?
In addition about the POD: maybe it's better not use use $string since
Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq
object.
Regards,
Bernd
From cjfields at uiuc.edu Tue Jul 25 08:28:01 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 07:28:01 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C5C2B3.1020304@sendu.me.uk>
References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
<44C5C2B3.1020304@sendu.me.uk>
Message-ID: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu>
Look, you explaining this to me, as you see it, does not convince me
that its the correct or right way to do it. Okay? Can we agree on
that? I do not think that Species and Taxonomy are the same thing.
A species should not hold more than one node. A species, by
definition, is a rank in Taxonomy, and is a node, not a full
Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't
see how I can be any clearer...
The fact that it may work is beyond the point. That's like putting
duct tape on a leak to me. Why not just simplify Bio::Species into a
Node? Or make it into a Node and get rid of it altogether.
You are going to do what you want to do, regardless of what I say.
Seems to be par for the course here. I'm REALLY tired of arguing the
point. Okay? Just drop it. I have other priorities in life besides
goddamned bioperl right now...
Chris
On Jul 25, 2006, at 2:05 AM, Sendu Bala wrote:
> Chris Fields wrote:
>>
>> There is one thing I will make perfectly clear here: there should
>> never, ever be enforced lookups for SeqIO (even using caches), though
>> I have no problem having optional ones. This is something I have
>> stated before and what you propose below steers dangerously in that
>> direction. Where, for instance, do you store the lineage from a
>> GenBank file? Do you want to do a series of Tax lookups to restore
>> that data? I think that the number one complaint for sequence
>> parsing is speed, which would only get slower with lookups (even
>> cached).
>
> I already gave a code example of exactly how Bio::Taxonomy is perfect
> for storing the lineage data in a GenBank file with or without a
> database lookup. I think perhaps at the time you first read this you
> basically ignored it because you had trouble with the idea of adding
> nodes to a species. If you have been glossing over my argument, it may
> be instructive to go over what I've been saying with a clear eye.
> Anyway, here it is again, and remember in this example,
> Bio::Species isa
> Bio::Taxonomy:
>
>
> ## the fully-manual way
> my $species = new Bio::Species;
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
> cerevisiae',
> -rank => 'species', -object_id
> => 1,
> -parent_id => 2);
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> -object_id => 2, -parent_id => 3);
> # (no assumption that 'Saccharomyces' is the genus, so rank()
> undefined)
> my $n3 = [etc]
> $species->add_node($node);
> $species->add_node($n2);
> [etc]
>
> ## Using a factory without db access
> # assume that Bio::Taxonomy::GenbankFactory implements
> # some modified Bio::Taxonomy::FactoryI
> my $factory = Bio::Taxonomy::GenbankFactory->new();
> my $species = $factory->generate(-classification => ['Saccharomyces
> cerevisiae', 'Saccharomyces',
> 'Saccharomycetaceae' ...]);
> # the generate() method above just does the fully-manual way for you
>
> ## Using a factory with db access
> # assume that Bio::Taxonomy::EntrezFactory implements some
> # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
> # to get the nodes
> my $factory = Bio::Taxonomy::EntrezFactory->new();
> my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
> cerevisiae');
>
>
> So now do you see how we're able to do the Genbank no-db way and the
> db-using way with the same object model? We're able to do it the same,
> sane way because a Node is just a node; you can make them yourself
> manually, or retrieve them from a database. Once you stick them in a
> Taxonomy you can then (potentially) ask all the questions of the data
> that you can with existing Bio::Species. No cruft is required anywhere
> at all. All the Taxonomy classes can be 'pure', while only
> Bio::Species
> has to have backward-compatibility methods.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Tue Jul 25 08:52:03 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 25 Jul 2006 13:52:03 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu>
References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
<44C5C2B3.1020304@sendu.me.uk>
<05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu>
Message-ID: <44C613F3.7070903@sendu.me.uk>
Chris Fields wrote:
> A species should not hold more than one node. A species, by
> definition, is a rank in Taxonomy, and is a node, not a full
> Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't
> see how I can be any clearer...
Right, we have differing viewpoints because you're concerned with what
Bio::Species /should/ be, based on the name of the file and perhaps its
original intent, whilst I am treating it as what it actually /is/, which
is an object that is used to contain information about multiple
taxonomic nodes.
> The fact that it may work is beyond the point. That's like putting
> duct tape on a leak to me. Why not just simplify Bio::Species into a
> Node? Or make it into a Node and get rid of it altogether.
Bio::Species, again ignore the name, is just a thing that lets us store
and retrieve a certain set of data. If we simplified it into a pure
Node, it could no longer do that job. If we just get rid of it all
together it can no longer do its job.
By making it a Bio::Taxonomy it can continue to do its job without
having to have Node objects with cruft. It would also gain the useful
methods of Bio::Taxonomy at the same time.
I really don't mean to upset you, and I apologise for having done so.
I've been presenting what I thought was a logical argument in favour of
Bio::Species as Bio::Taxonomy, and waiting to see if anyone would come
up with a logical argument why that would be inappropriate, or why
something else would be better.
I'm not saying you're wrong and I'm certainly listening and would change
my choice based on what you have to say. I don't think it's fair to say
that disregarding what you have to say is 'par for the course' - I
already /have/ regarded what you had to say in this thread and ended up
doing scientific_name() as purely what we get from the database.
From hlapp at gmx.net Tue Jul 25 09:47:47 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 25 Jul 2006 09:47:47 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C5C2B3.1020304@sendu.me.uk>
References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
<44C5C2B3.1020304@sendu.me.uk>
Message-ID:
On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote:
> [...]
> ## the fully-manual way
> my $species = new Bio::Species;
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
> cerevisiae',
> -rank => 'species', -object_id
> => 1,
> -parent_id => 2);
If this is meant as an example for the use cases I enumerated, then
you wouldn't have the parent_id from a Genbank file. However, you
didn't have that before either, so no problem.
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> -object_id => 2, -parent_id => 3);
> # (no assumption that 'Saccharomyces' is the genus, so rank()
> undefined)
I think in a confident parse you want to assign 'genus' if there's
little doubt, for example 'Saccharomyces cerevisiae'. Not sure
whether there are weird viri whose names look innocuous but in
reality the name doesn't follow binomial convention.
> my $n3 = [etc]
> $species->add_node($node);
> $species->add_node($n2);
I know why you are doing this, but seeing this people will hit a
mental snag. You should listen to Chris' refusal to see the sense in
this as an indication that many people down the road won't see the
sense either.
So instead, make the logical model in your design more obvious, which
I think ultimately will help maintainability as well. For example:
my $taxonomy = Bio::Taxonomy->new();
my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
-rank => 'species', -object_id
=> 1,
-parent_id => 2);
my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
-object_id => 2, -parent_id => 3);
$taxonomy->add_node($node);
$taxonomy->add_node($n2);
my $species = Bio::Species->new(-lineage => $taxonomy);
print $species->binomial();
print $species->genus();
# this may trigger a lookup if a taxonomy db handle has been set, e.g.:
# $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez'));
print $species->classification();
> [etc]
>
> ## Using a factory without db access
> # assume that Bio::Taxonomy::GenbankFactory implements
> # some modified Bio::Taxonomy::FactoryI
> my $factory = Bio::Taxonomy::GenbankFactory->new();
> my $species = $factory->generate(-classification => ['Saccharomyces
> cerevisiae', 'Saccharomyces',
> 'Saccharomycetaceae' ...]);
> # the generate() method above just does the fully-manual way for you
Except the method name would be create_object(), the parameter would
be a hash ref, and the return value would be a Bio::TaxonomyI
compliant object:
my $taxonomy = $factory->create_object({-classification =>
['Saccharomyces
cerevisiae', 'Saccharomyces',
'Saccharomycetaceae' ...]});
my $species = Bio::Species->new(-lineage => $taxonomy);
>
> ## Using a factory with db access
> # assume that Bio::Taxonomy::EntrezFactory implements some
> # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
> # to get the nodes
> my $factory = Bio::Taxonomy::EntrezFactory->new();
The logic where to do a lookup on should not be duplicated here. It
only belongs under Bio::DB::Taxonomy::*.
> my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
> cerevisiae');
Likewise, use the methods defined in Bio::DB::Taxonomy, and again,
the return type is Bio::Taxonomy, which you would pass to
Bio::Species->new().
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Tue Jul 25 09:54:14 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 25 Jul 2006 09:54:14 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu>
References: <000101c6af68$f27521a0$15327e82@pyrimidine>
<44C54683.70707@sendu.me.uk>
<6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu>
<49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net>
<190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu>
Message-ID: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net>
We intend on having everyone who wants correct taxonomy parsing
results for the entire kingdom of life to define his/her
authoritative taxonomy database, be it local or not, be it HTTP or
SQL queried.
If you don't care about the correctness of the taxonomy parse, or if
the taxonomy information in the flat file is trivially parseable
because it conforms to standard binomial convention, then whatever is
to be put in place needs to work fine regardless of whether a
taxonomy database is defined or not.
-hilmar
On Jul 25, 2006, at 1:53 AM, Chris Fields wrote:
> So do we intend on having everyone who installs bioperl have a local
> copy of the taxonomy dumpfile? Or perform a remote lookup via
> Entrez? Seems a bit extreme.
>
> I would like the option of not having the lookup run; as I mentioned
> to Sendu, one of the biggest complaints about bioperl is speed.
> Additional lookups won't help on that end.
>
> Chris
>
> On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote:
>
>>
>> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote:
>>
>>> [...]
>>> We could go back and forth on what Jason really intended. [...] The
>>> reality is he's not here and you're willing to do the job.
>>
>> Right. And, knowing Jason, I think he'd be perfectly fine with seeing
>> his original idea develop in a possibly different direction, provided
>> it will all work nicely in the end. I'm willing to take the beating
>> on me if that doesn't turn out to be true ...
>>
>>>
>>> There is one thing I will make perfectly clear here: there should
>>> never, ever be enforced lookups for SeqIO (even using caches),
>>
>> You certainly don't want taxonomy lookups during the parsing stage,
>> and also not for the client requesting properties of the species that
>> have been parsed with high confidence, i.e., genus and species for a
>> straightforward binomial like 'Homo sapiens'.
>>
>> Writing sequences, IMHO, doesn't have to be as fast. It may be better
>> to emit strict format a bit slower rather than sloppy format a bit
>> faster.
>>
>> Upon parsing, one idea could be for the flat file parser to set a
>> dirty bit in the parsed out species if the parsed text didn't follow
>> strict binomial conventions, hence the parser may have made a mistake
>> and if a client requests the information it is better to lookup the
>> correct values from a taxonomy database. I.e., you could try with a
>> strict regex first that would imply a high-confidence result. If that
>> fails you don't give up but mark the result as untrustworthy.
>>
>>
>>> [...]
>>> This would have been MUCH easier if all three of us could have gone
>>> to the local bar for a beer and discussed it. We should just take
>>> the time out to videoconference next time.
>>
>> You're not honestly suggesting that a videoconference is better than
>> having beer together?
>>
>> Enjoy your trip, and thanks for hanging in there in the discussion, I
>> appreciate it.
>>
>> -hilmar
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Tue Jul 25 10:58:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 09:58:29 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net>
Message-ID: <002601c6affa$ca4433f0$15327e82@pyrimidine>
Agreed. I fully support the addition of an optional lookup; it gives much
more flexibility SeqIO re: your previous examples of screening sequence
streams for sequences that are primate, mitochondrial, etc. The key word I
want to emphasize is 'optional', not 'enforced'.
I appreciate what Sendu is trying to do; I really do. I think carrying over
an object named 'Bio::Species' into Taxonomy is too confusing (your
'contagion' analogy, as it were). The 'species' concept (biologically
speaking here, not talking about the Bioperl class) is a taxonomic rank
(i.e. part of a taxonomy). I'm trying to take a biologist's point of view
here. What is a 'species'? Or, if we were to stick strictly with using
NCBI definitions, what is a 'species'?
The NCBI definition of 'species' is simply a rank in a lineage, so it is (in
Bioperl terms) a Node. If we were to follow that line of reasoning, why
also have a Species object represent a Taxonomy as well? It's way too
confusing.
Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a
BioPerl world only, as we're speaking about a class that has been around for
a long time, one that acted as a container of sorts for sequence data. And
I understand what he intends to do.
Conceptually speaking here, though, the way it is laid out, a Bio::Species
object can hold a Node that represents a 'species' rank, as well as a
'genus' Node, and a 'family' node, and on and on. That's not a 'species',
that's a taxonomy. So just call it a Taxonomy.
The object itself (Bio::Species) never truly represented a 'species' anyway,
biologically speaking, every time it held sequence data. It could be a
subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or
environmental sample. It really held a fancier representation of a node, as
based on the TaxID.
My final point is, saying "a species is a taxonomy" to the rest of the
biological world doesn't make sense. Maybe it makes sense to you and I and
Sendu, in our little Bioperl world. But to the thousands of users out there
who don't completely grok the Bioperl class structure, it's just confusing.
If I were to get an object back that was labeled Bio::Species, as a
biologist I would expect it to be part of a taxonomy, not the actual
Taxonomy itself. So, why not cut to the chase: if we are to fundamentally
change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI
or whatever, why not just use a Taxonomy object altogether and not bother
with Bio::Species at all? Deprecate it.
BTW, I'll be in Connecticut for five days at UConn. So I hope to escape the
heat for a bit. Thanks for listening to my side of things.
Chris
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Tuesday, July 25, 2006 8:54 AM
> To: Chris Fields
> Cc: Sendu Bala; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> We intend on having everyone who wants correct taxonomy parsing
> results for the entire kingdom of life to define his/her
> authoritative taxonomy database, be it local or not, be it HTTP or
> SQL queried.
>
> If you don't care about the correctness of the taxonomy parse, or if
> the taxonomy information in the flat file is trivially parseable
> because it conforms to standard binomial convention, then whatever is
> to be put in place needs to work fine regardless of whether a
> taxonomy database is defined or not.
>
> -hilmar
>
> On Jul 25, 2006, at 1:53 AM, Chris Fields wrote:
>
> > So do we intend on having everyone who installs bioperl have a local
> > copy of the taxonomy dumpfile? Or perform a remote lookup via
> > Entrez? Seems a bit extreme.
> >
> > I would like the option of not having the lookup run; as I mentioned
> > to Sendu, one of the biggest complaints about bioperl is speed.
> > Additional lookups won't help on that end.
> >
> > Chris
> >
> > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote:
> >
> >>
> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote:
> >>
> >>> [...]
> >>> We could go back and forth on what Jason really intended. [...] The
> >>> reality is he's not here and you're willing to do the job.
> >>
> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing
> >> his original idea develop in a possibly different direction, provided
> >> it will all work nicely in the end. I'm willing to take the beating
> >> on me if that doesn't turn out to be true ...
> >>
> >>>
> >>> There is one thing I will make perfectly clear here: there should
> >>> never, ever be enforced lookups for SeqIO (even using caches),
> >>
> >> You certainly don't want taxonomy lookups during the parsing stage,
> >> and also not for the client requesting properties of the species that
> >> have been parsed with high confidence, i.e., genus and species for a
> >> straightforward binomial like 'Homo sapiens'.
> >>
> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better
> >> to emit strict format a bit slower rather than sloppy format a bit
> >> faster.
> >>
> >> Upon parsing, one idea could be for the flat file parser to set a
> >> dirty bit in the parsed out species if the parsed text didn't follow
> >> strict binomial conventions, hence the parser may have made a mistake
> >> and if a client requests the information it is better to lookup the
> >> correct values from a taxonomy database. I.e., you could try with a
> >> strict regex first that would imply a high-confidence result. If that
> >> fails you don't give up but mark the result as untrustworthy.
> >>
> >>
> >>> [...]
> >>> This would have been MUCH easier if all three of us could have gone
> >>> to the local bar for a beer and discussed it. We should just take
> >>> the time out to videoconference next time.
> >>
> >> You're not honestly suggesting that a videoconference is better than
> >> having beer together?
> >>
> >> Enjoy your trip, and thanks for hanging in there in the discussion, I
> >> appreciate it.
> >>
> >> -hilmar
> >> --
> >> ===========================================================
> >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
From cjfields at uiuc.edu Tue Jul 25 11:36:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 10:36:40 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
Message-ID: <003301c6b000$203cc560$15327e82@pyrimidine>
> On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote:
>
> > [...]
> > ## the fully-manual way
> > my $species = new Bio::Species;
> > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
> > cerevisiae',
> > -rank => 'species', -object_id
> > => 1,
> > -parent_id => 2);
>
> If this is meant as an example for the use cases I enumerated, then
> you wouldn't have the parent_id from a Genbank file. However, you
> didn't have that before either, so no problem.
>
> > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> > -object_id => 2, -parent_id => 3);
> > # (no assumption that 'Saccharomyces' is the genus, so rank()
> > undefined)
>
> I think in a confident parse you want to assign 'genus' if there's
> little doubt, for example 'Saccharomyces cerevisiae'. Not sure
> whether there are weird viri whose names look innocuous but in
> reality the name doesn't follow binomial convention.
>
> > my $n3 = [etc]
> > $species->add_node($node);
> > $species->add_node($n2);
>
> I know why you are doing this, but seeing this people will hit a
> mental snag. You should listen to Chris' refusal to see the sense in
> this as an indication that many people down the road won't see the
> sense either.
Thanks for pointing that out. I think there is only a small, fundamental
difference in our views here. I'm trying to view this as an outsider would,
a biologist not familiar with the Bioperl class structure. I understand
what Sendu's trying to accomplish but it's really confusing to someone not
familiar with what Bio::Species is.
Hilmar, you had pointed out several times that Bio::Species and
Bio::Taxonomy shouldn't directly intermingle.
My original thought for genbank.pm _read_GenBank_Species() was this, copied
and pasted from my local genbank.pm. It's sort of extreme, but it passes
tests just fine.
sub _read_GenBank_Species {
my( $self,$buffer) = @_;
$_ = $$buffer;
my @organelles = qw(plastid chloroplast mitochondrion);
my( $source_data, $common_name, @class, $ns_name, $organelle,
$source_flag, $sci_name, $abbr );
while (defined($_) || defined($_ = $self->_readline())) {
# de-HTMLify (links that may be encountered here don't
contain
# escaped '>', so a simple-minded approach suffices)
s/<[^>]+>//g;
if ( /^SOURCE\s+(.*)/o ) {
$source_data = $1;
$source_data =~ s/\.$//; # remove trailing dot
# does it have a GenBank common name in parentheses?
$common_name = $source_data =~ m{\((.*)\)}xms;
# organelle? If we find additional odd ones,
# add to @organelle
$organelle = grep { $_ =~ $source_data }
@organelles;
$source_flag = 1;
} elsif ( /^\s{2}ORGANISM\s+(.*)/o ) {
$sci_name = $1;
$source_flag = 0;
} elsif ($source_flag) { # no ORGANISM
$common_name .= $source_data;
$common_name =~ s/\n//g;
$common_name =~ s/\s+/ /g;
$source_flag = 0;
} elsif ( /^\s+(.+)/o ) { # lineage information
my $line = $1;
# only split on ';' or '.' so that classification
# that is 2 words will still get matched, use
# map() to remove trailing/leading spaces
push(@class, map { s/^\s+//; s/\s+$//; $_; }
split /[;\.]+/, $line)
if ( $line =~ /(;|\.)/ );
} else { # reach end of GenBank tax info
last;
}
$_ = undef; # Empty $_ to trigger read of next line
}
$$buffer = $_;
@class = reverse @class;
my $make = Bio::Taxonomy::Node->new();
$make->common_name( $common_name ) if $common_name;
$make->scientific_name($sci_name) if $sci_name;
# could use SimpleValue objs here instead
$make->classification( @class ) if @class;
$make->organelle($organelle) if $organelle;
return $make;
}
# back in next_seq...grab the TaxID from 'source'
# seqfeature
# could check organelle() here as well
# add taxon_id from source if available
if($species && ($feat->primary_tag eq 'source') &&
$feat->has_tag('db_xref') && (! $species->ncbi_taxid())) {
foreach my $tagval ($feat->get_tag_values('db_xref')) {
if(index($tagval,"taxon:") == 0) {
$species->ncbi_taxid(substr($tagval,6));
last;
}
}
}
In other words, remove the extra parsing of genus() species() subspecies
etc. All GenBank sequences have a node represented in NCBI's tax database
(I checked it out). Even plasmids, unknowns, environmental samples.
Chris
> So instead, make the logical model in your design more obvious, which
> I think ultimately will help maintainability as well. For example:
>
> my $taxonomy = Bio::Taxonomy->new();
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
> -rank => 'species', -object_id
> => 1,
> -parent_id => 2);
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> -object_id => 2, -parent_id => 3);
> $taxonomy->add_node($node);
> $taxonomy->add_node($n2);
>
> my $species = Bio::Species->new(-lineage => $taxonomy);
> print $species->binomial();
> print $species->genus();
> # this may trigger a lookup if a taxonomy db handle has been set, e.g.:
> # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez'));
> print $species->classification();
>
>
> > [etc]
> >
> > ## Using a factory without db access
> > # assume that Bio::Taxonomy::GenbankFactory implements
> > # some modified Bio::Taxonomy::FactoryI
> > my $factory = Bio::Taxonomy::GenbankFactory->new();
> > my $species = $factory->generate(-classification => ['Saccharomyces
> > cerevisiae', 'Saccharomyces',
> > 'Saccharomycetaceae' ...]);
> > # the generate() method above just does the fully-manual way for you
>
> Except the method name would be create_object(), the parameter would
> be a hash ref, and the return value would be a Bio::TaxonomyI
> compliant object:
>
> my $taxonomy = $factory->create_object({-classification =>
> ['Saccharomyces
> cerevisiae', 'Saccharomyces',
> 'Saccharomycetaceae' ...]});
> my $species = Bio::Species->new(-lineage => $taxonomy);
>
>
> >
> > ## Using a factory with db access
> > # assume that Bio::Taxonomy::EntrezFactory implements some
> > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
> > # to get the nodes
> > my $factory = Bio::Taxonomy::EntrezFactory->new();
>
> The logic where to do a lookup on should not be duplicated here. It
> only belongs under Bio::DB::Taxonomy::*.
>
> > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
> > cerevisiae');
>
> Likewise, use the methods defined in Bio::DB::Taxonomy, and again,
> the return type is Bio::Taxonomy, which you would pass to
> Bio::Species->new().
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Tue Jul 25 13:49:04 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Tue, 25 Jul 2006 18:49:04 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003301c6b000$203cc560$15327e82@pyrimidine>
References: <003301c6b000$203cc560$15327e82@pyrimidine>
Message-ID: <44C65990.4080500@sendu.me.uk>
Chris Fields wrote:
> If I were to get an object back that was labeled Bio::Species, as a
> biologist I would expect it to be part of a taxonomy, not the actual
> Taxonomy itself.
I think this is the most important sentence in the discussion. Ok, so
it's clear to me that a better solution is needed than my
Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I
also needed to start trying to code my Taxonomy proposal to see some
issues with it.
[... in another email...]
> I'm trying to view this as an outsider would,
> a biologist not familiar with the Bioperl class structure.
Ok, let's come up with a proposal that makes sense to the biologist and
better matches Jason's original idea.
---- long post follows; there's a summary at the end
As a biologist when I consider a species I have the following primary
questions. Let's see how we would answer them using a) Bio::Species and
genbank.pm as they are now, b) Bio::Species if it was a 'pure'
Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species
and used Node directly), and Chris' updated genbank.pm. Let's say we got
our species information from a genbank file where the scientific name
and tax id are available to be parsed out.
# What is the species' name?
a) Not guaranteed to be correct.
b) Correct thanks to recent changes to Node, just use scientific_name()
# What is the lineage of this species?
a) I can get a classification array with classification(). It's a bit
rubbish though, I can't tell what any of the array elements are supposed
to be.
b) A pure Node wouldn't store the lineage on itself. There are two
obvious solutions: 1) add cruft to Node by giving it a classification()
method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has
the benefit of telling me what rank each ancestor was, if that
information had been in the file (more likely, if Node was generated
from database). Problem: get_Lineage_Nodes() only works if it can
$self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id);
which obviously doesn't work if the nodes in our lineage didn't come
from a database, but from the parsing of a genbank flat file. As we
parse the genbank file we can certainly make nodes for each word in the
list:
inside genbank.pm... @class = reverse @class;
my @nodes; my $fake_id = 1;
foreach my $sci_name (@class) {
push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id =>
$fake_id++, parent_id => $fake_id);
}
But how do we keep these nodes and make them returnable later by
get_Lineage_Nodes? Perhaps:
my $taxonomy = new Bio::Taxonomy;
foreach my $node (@nodes) {
$taxonomy->add_node($node);
}
...
my $make = Bio::Taxonomy::Node->new();
...
$make->db_handle($taxonomy);
Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node
which only accepts a rank). Of course this is ugly, storing a Taxonomy
in our database handle. We could have a new Bio::DB::Taxonomy:: class
instead, that treated a classification array like a database? It could
have the added bonus of building up an entire database internally as
more input arrays are given to it, able to therefore give each node a
unique but consistent id. It would break if one time you gave it qw(Homo
Primates) and another time qw(Homo Hominidae Primates), however. Ideas?
# What if I don't want the whole lineage, just to know what a specific
rank like genus is for my species?
a) use genus(), but not guaranteed to be correct.
b) two solutions: 1) add cruft to Node by adding a genus() method: as
good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until
you find a node with your rank() of interest. Same problems as for
lineage question, but also it would be nicer to have a
get_node('rank_name') style method. But such a method belongs in
something like Bio::Taxonomy, not Node. At the very least a method like
genus() would be implemented using pure Node methods like
get_Parent_Node(), returning undefined if no parent had a rank() of
'genus', never guessing it.
# Is this species the same as another species?
a) Not guaranteed to be correct. (no unique id so forced to compare names)
b) Correct answer by using object_id() method, along with Chris' change
to genbank.pm.
# What is the most recent common ancestor of this species and another?
a) Can't be answered.
b) Use get_LCA_Node(), but same issues as the lineage question, since
get_LCA_Node requires a working get_Lineage_Nodes(). It also requires
correct (unique) ids for all nodes in all lineages to give the
guaranteed correct answer. But at least you /might/ get the correct
answer even using only the data in genbank files and no db lookup.
---- summary:
It seems like the main problem with Node right now is that it has
classification() and things like genus(). I propose pure Node method
solutions to answer the questions classification() and genus() were
implemented to answer, but in a better, cruft-free way.
Bio::DB::Taxonomy::genbank anyone?
Then if you started with a Species/Node generated by a genbank parse,
and wanted certain questions answered correctly, you only have to set a
different db_handle(). The Node only stores the static and hopefully
correct information about itself, whilst all other questions go via
db_handle, so you can dynamically swap back and forth between databases
depending on if you need speed or accuracy.
From cjfields at uiuc.edu Tue Jul 25 14:24:12 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 13:24:12 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C65990.4080500@sendu.me.uk>
Message-ID: <000001c6b017$873176a0$15327e82@pyrimidine>
Sendu, you'll have to make the changes how you see fit. You see my point
now, which is great.
>From my perspective, all the object type (used to contain taxonomy file
information) needs to contain is the scientific name and common names like
the SOURCE line abbreviated name and the actual GenBank common name, if
present. All the other cruft (i.e. genus/species/subspecies) can be
excised, and the proper taxonomic information, if wanted, could be accessed
via the object and it's TaxID. Organelle and lineage information needs to
be retained (for the non-taxonomists) and could be stored in that object,
bumped to SimpleValue objects, or just set (alternative, since the data is
small) using a get/set value within the sequence object itself. This would
be the bare-bones approach, which Node can fulfill.
I also like Hilmar's proposal about including optional lookups, which
greatly increases the flexibility when screening sequences. This will
likely require a more complicated object structure (i.e. taxonomy with
nodes). You suggested a Taxonomy-like object which would work; but don't
force Bio::Species into the mix. Why not just use a simple Bio::Taxonomy
object for that (Hilmar's point).
When one asks for $species->species, they'll get a Node or Taxonomy,
whichever is used (that's up to you). The Node represents a more-barebones
variation, while the Taxonomy object scheme would be more fully-realized.
Either way will work for me. Just don't call it 'species'. ; >
Once this is all done, will we really have a need for Bio::Species? That's
my other point. The only real use for it was as a container object for
sequence data. That job is now done via a Taxonomy/Node object. The only
real use it would have is as a container for taxonomic information for
species ranks or below. I think Node/Taxonomy can handle evan that though,
so now it's also redundant. If a class is not useful and is redundant,
maybe it should be deprecated.
Anyway, I can't get involved anymore at this point; I'm too busy with
getting ready for the Kadner Institute next week. Good luck!
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Tuesday, July 25, 2006 12:49 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Chris Fields wrote:
> > If I were to get an object back that was labeled Bio::Species, as a
> > biologist I would expect it to be part of a taxonomy, not the actual
> > Taxonomy itself.
>
> I think this is the most important sentence in the discussion. Ok, so
> it's clear to me that a better solution is needed than my
> Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I
> also needed to start trying to code my Taxonomy proposal to see some
> issues with it.
>
>
> [... in another email...]
> > I'm trying to view this as an outsider would,
> > a biologist not familiar with the Bioperl class structure.
>
> Ok, let's come up with a proposal that makes sense to the biologist and
> better matches Jason's original idea.
>
> ---- long post follows; there's a summary at the end
>
> As a biologist when I consider a species I have the following primary
> questions. Let's see how we would answer them using a) Bio::Species and
> genbank.pm as they are now, b) Bio::Species if it was a 'pure'
> Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species
> and used Node directly), and Chris' updated genbank.pm. Let's say we got
> our species information from a genbank file where the scientific name
> and tax id are available to be parsed out.
>
> # What is the species' name?
> a) Not guaranteed to be correct.
> b) Correct thanks to recent changes to Node, just use scientific_name()
>
>
> # What is the lineage of this species?
> a) I can get a classification array with classification(). It's a bit
> rubbish though, I can't tell what any of the array elements are supposed
> to be.
> b) A pure Node wouldn't store the lineage on itself. There are two
> obvious solutions: 1) add cruft to Node by giving it a classification()
> method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has
> the benefit of telling me what rank each ancestor was, if that
> information had been in the file (more likely, if Node was generated
> from database). Problem: get_Lineage_Nodes() only works if it can
> $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id);
> which obviously doesn't work if the nodes in our lineage didn't come
> from a database, but from the parsing of a genbank flat file. As we
> parse the genbank file we can certainly make nodes for each word in the
> list:
> inside genbank.pm... @class = reverse @class;
> my @nodes; my $fake_id = 1;
> foreach my $sci_name (@class) {
> push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id =>
> $fake_id++, parent_id => $fake_id);
> }
> But how do we keep these nodes and make them returnable later by
> get_Lineage_Nodes? Perhaps:
> my $taxonomy = new Bio::Taxonomy;
> foreach my $node (@nodes) {
> $taxonomy->add_node($node);
> }
> ...
> my $make = Bio::Taxonomy::Node->new();
> ...
> $make->db_handle($taxonomy);
> Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node
> which only accepts a rank). Of course this is ugly, storing a Taxonomy
> in our database handle. We could have a new Bio::DB::Taxonomy:: class
> instead, that treated a classification array like a database? It could
> have the added bonus of building up an entire database internally as
> more input arrays are given to it, able to therefore give each node a
> unique but consistent id. It would break if one time you gave it qw(Homo
> Primates) and another time qw(Homo Hominidae Primates), however. Ideas?
>
>
> # What if I don't want the whole lineage, just to know what a specific
> rank like genus is for my species?
> a) use genus(), but not guaranteed to be correct.
> b) two solutions: 1) add cruft to Node by adding a genus() method: as
> good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until
> you find a node with your rank() of interest. Same problems as for
> lineage question, but also it would be nicer to have a
> get_node('rank_name') style method. But such a method belongs in
> something like Bio::Taxonomy, not Node. At the very least a method like
> genus() would be implemented using pure Node methods like
> get_Parent_Node(), returning undefined if no parent had a rank() of
> 'genus', never guessing it.
>
>
> # Is this species the same as another species?
> a) Not guaranteed to be correct. (no unique id so forced to compare names)
> b) Correct answer by using object_id() method, along with Chris' change
> to genbank.pm.
>
>
> # What is the most recent common ancestor of this species and another?
> a) Can't be answered.
> b) Use get_LCA_Node(), but same issues as the lineage question, since
> get_LCA_Node requires a working get_Lineage_Nodes(). It also requires
> correct (unique) ids for all nodes in all lineages to give the
> guaranteed correct answer. But at least you /might/ get the correct
> answer even using only the data in genbank files and no db lookup.
>
>
> ---- summary:
>
> It seems like the main problem with Node right now is that it has
> classification() and things like genus(). I propose pure Node method
> solutions to answer the questions classification() and genus() were
> implemented to answer, but in a better, cruft-free way.
>
> Bio::DB::Taxonomy::genbank anyone?
>
> Then if you started with a Species/Node generated by a genbank parse,
> and wanted certain questions answered correctly, you only have to set a
> different db_handle(). The Node only stores the static and hopefully
> correct information about itself, whilst all other questions go via
> db_handle, so you can dynamically swap back and forth between databases
> depending on if you need speed or accuracy.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Tue Jul 25 15:18:00 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 25 Jul 2006 15:18:00 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <000001c6b017$873176a0$15327e82@pyrimidine>
References: <000001c6b017$873176a0$15327e82@pyrimidine>
Message-ID: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net>
On Jul 25, 2006, at 2:24 PM, Chris Fields wrote:
> Once this is all done, will we really have a need for Bio::Species?
No, except for backwards compatibility. Phasing it out will go over a
couple of releases. E.g., v1.6.x could have deprecation warning in
the documentation. v1.7+ would have deprecation warnings in the code
written to stderr.
Just as an aside, we can't just drastically change the return type of
a method. Instead, if at all possible, there should be a new method
so that the old can be phased out over time but otherwise not
changed. I.e., don't change $seq->species() to now all of a sudden
return a node or taxonomic lineage, even if initially Bio::Species is
returned with some magic under the hood. Instead, create something like
# return a Bio::Taxonomy::Node:
my $taxon = $seq->taxon();
# alternative approach: return a lineage (taxonomy)
# this would be Bio::TaxonomyI compliant
my $lineage = $seq->lineage();
The former would require the lineage (and organelle for completeness)
information to be either easily (though not necessarily directly)
accessible through the node, or added as annotation.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Tue Jul 25 15:30:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 14:30:40 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net>
Message-ID: <000101c6b020$d09bc7b0$15327e82@pyrimidine>
Sounds good to me. I'm fine with any way that it's worked out, either
Taxonomy or Node-based, as long as there no Bio::Species-based confusion re:
Taxonomy, and that this eventually leads to getting rid of Bio::Species
altogether.
Have fun, guys!
(hey, probably the shortest response I have written)...
Chris
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Tuesday, July 25, 2006 2:18 PM
> To: Chris Fields
> Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
>
> On Jul 25, 2006, at 2:24 PM, Chris Fields wrote:
>
> > Once this is all done, will we really have a need for Bio::Species?
>
> No, except for backwards compatibility. Phasing it out will go over a
> couple of releases. E.g., v1.6.x could have deprecation warning in
> the documentation. v1.7+ would have deprecation warnings in the code
> written to stderr.
>
> Just as an aside, we can't just drastically change the return type of
> a method. Instead, if at all possible, there should be a new method
> so that the old can be phased out over time but otherwise not
> changed. I.e., don't change $seq->species() to now all of a sudden
> return a node or taxonomic lineage, even if initially Bio::Species is
> returned with some magic under the hood. Instead, create something like
>
> # return a Bio::Taxonomy::Node:
> my $taxon = $seq->taxon();
>
> # alternative approach: return a lineage (taxonomy)
> # this would be Bio::TaxonomyI compliant
> my $lineage = $seq->lineage();
>
> The former would require the lineage (and organelle for completeness)
> information to be either easily (though not necessarily directly)
> accessible through the node, or added as annotation.
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
From cjfields at uiuc.edu Tue Jul 25 22:16:36 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 25 Jul 2006 21:16:36 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C65990.4080500@sendu.me.uk>
References: <003301c6b000$203cc560$15327e82@pyrimidine>
<44C65990.4080500@sendu.me.uk>
Message-ID: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
One last thing before I shut off bioperl for a week and concentrate
on Connecticut;
On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> If I were to get an object back that was labeled Bio::Species, as a
>> biologist I would expect it to be part of a taxonomy, not the actual
>> Taxonomy itself.
>
> I think this is the most important sentence in the discussion. Ok, so
> it's clear to me that a better solution is needed than my
> Bio::Taxonomy-related proposal. Sorry for being so slow on the
> uptake. I
> also needed to start trying to code my Taxonomy proposal to see some
> issues with it.
... Again, thanks for noticing that.
> ---- summary:
>
> It seems like the main problem with Node right now is that it has
> classification() and things like genus(). I propose pure Node method
> solutions to answer the questions classification() and genus() were
> implemented to answer, but in a better, cruft-free way.
>
> Bio::DB::Taxonomy::genbank anyone?
Ach... You're compromising here; that's not like you. I think
you're making this too complicated by trying too many things at
once. Don't think sudden dramatic changes in the API. Sneak changes
in in a way that doesn't scare users away, then let them get used to
the new way of grabbing Tax data. Make your point that it's more
accurate to do it this way (you'll have defenders in Hilmar and I, BTW).
Do this (start with genbank.pm):
1) Switch out Bio::Species with Node or Taxonomy; relocate other
information temporarily (Bio::Species, get/sets in Seq object,
SimpleValue). Leave Bio::Species in for the time being, but don't
bother making any additional changes to it.
2) Make sure next_seq() and write_seq() work and pass tests. Add
additional tests for the Tax/Node object (you could even use the tax
dump data you recently added for more complicated tests).
3) Add in additional stuff bit by bit until it is where you would
like it.
4) Make sure parsing is kosher with the latest release notes.
Probably should make sure write_seq follows what the release note
state to some degree.
And, really, you won't break anything with genbank.pm organelle()
parsing. If you look at the module the organelle isn't even touched
in next_seq() or _read_GenBank_Species(), so it was broken to begin
with!
My proposal, though extreme, was to remove genus() etc (which you
wanted as well with Node). You could leave this cruft for the time
being in Bio::Species, which could still act as a sequence tax info
holder object. It just won't be the >default< Seq tax information
object, which would be Bio::Taxonomy or Node.
Hence Hilmar's suggestion to use a $seq->taxon() method to return a
Node/Taxonomy, and a $seq->species() would still return a
Bio::Species object. It's redundant, but only for the time being,
and the redundant information wouldn't have a major memory footprint
anyway (not like the feature table or the full sequence might). Any
information that isn't stored in whatever Tax object you use (i.e.
lineage or organelle) could be stored temporarily in another fashion,
such as a get/set in Seq or SimpleValue object, to make next_seq/
write_seq work (such as $seq->organelle() or $seq->classification(),
instead of $seq->species->organelle and so on).
Hilmar then suggests, around 1.6-ish release, note the changes made
to SeqIO towards Bio::Taxonomy-based objects, and indicate that
Bio::Species via species() and it's associated methods will be
deprecated around 1.7 (gives everybody notice on API issues). Then
add warnings to Bio::Species in 1.7 noting the deprecation, then
remove from core completely in 1.8 - 2.0.
One last thing, which is minor really: I remember seeing something
about having Nodes with 'no rank' ignored unless a flag is used.
That may be bad news for some organisms in sequence files where the
TaxID is for a 'no rank' rank, such as environmental samples. May
want to think about that here.
I'm hoping the releases will start popping out a bit more
periodically than they have been. There have been volunteers to
release periodic updates for bug fixes etc.
If I get a chance I'll try keeping up. Don't count on it though.
The conference is 7am-9pm most days, for five days straight!
Chris
>
> Then if you started with a Species/Node generated by a genbank parse,
> and wanted certain questions answered correctly, you only have to
> set a
> different db_handle(). The Node only stores the static and hopefully
> correct information about itself, whilst all other questions go via
> db_handle, so you can dynamically swap back and forth between
> databases
> depending on if you need speed or accuracy.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From vrramnar at student.cs.uwaterloo.ca Tue Jul 25 22:44:17 2006
From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca)
Date: Tue, 25 Jul 2006 22:44:17 -0400
Subject: [Bioperl-l] SNP reference file download
In-Reply-To: <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu>
References: <000001c6b01f$bfd54e20$15327e82@pyrimidine>
<1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca>
<775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu>
Message-ID: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca>
Hey Chris,
I believe I updated all those modules already as I downloaded the entire DB.tar
from Bioperl live. Here is my code:
#!/usr/bin/perl -w
use Bio::Perl;
use Bio::DB::EUtilities;
my @ids = qw(rs4986950);
# With the "rs" before the number the warning says: "no returned links"
# Without the "rs" before the number the warning says: "No databases returned;
empty linkset"
my $elink = Bio::DB::EUtilities->new( -eutil => 'elink',
-id => \@ids,
-db => 'omim',
-dbfrom => 'snp');
$elink->get_response;
print "IDs: ", join q(,), $elink->get_ids;
Which gives the following error:
-------------------- WARNING ---------------------
MSG: No databases returned; empty linkset
---------------------------------------------------
------------- EXCEPTION -------------
MSG: Must use database to access IDs
STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/Perl/5.8.6/Bio/
DB/EUtilities/ElinkData.pm:201
STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/EUtilities.pm:482
STACK toplevel getOmimNum:13
--------------------------------------
All I really want is the OMIM id number under the section: NCBI Resource Links
from the page:
http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562
Any idea why this still isn't working??
Rohan
Quoting Chris Fields :
> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was
> wrong. I plan on changing this to a more robust parser soon (likely
> XML::SAX or XML::Twig, which will also require a download).
>
> That warning occurs when if you don't have a link to OMIM present (No
> databases returned; empty linkset). The way Elink works is it stores
> internal data in a separate object (ELinkData) contained in an
> internal cache. The method get_ids() works for all EUtilities to
> retrieve IDs, even from ELink objects. The unique problem with ELink
> is, since you can search multiple databases. you can retrieve
> multiple sets of IDs.
>
> If you haven't done it, update your EUtilities; the problem is
> similar to one I fixed today (I stated something about updating in my
> last post). Also, update the main Bio::DB::EUtilities and
> Bio::GenericWebDBI as well (the last is the base class from which
> EUtilities is based). The 'Count:1' was a debugging statement I
> forgot to remove a while ago which I changed in CVS yesterday. It's
> possible that commit had other changes which I forgot about.
>
> Sorry about that, but it is still experimental (emphasis on the
> 'mental').
>
> Chris
>
> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote:
>
> >
> > Hey Chris,
> >
> > Ignore the last email, I fixed that problem and downloaded/
> > installed the
> > required XML modules.
> >
> > However, I am now getting this error message:
> >
> > -------------------- WARNING ---------------------
> > MSG: No databases returned; empty linkset
> > ---------------------------------------------------
> > Count: 1
> >
> > ------------- EXCEPTION -------------
> > MSG: Must use database to access IDs
> > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/
> > Perl/5.8.6/Bio/
> > DB/EUtilities/ElinkData.pm:201
> > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/
> > EUtilities.pm:483
> > STACK toplevel getOmimNum:15
> >
> > --------------------------------------
> >
> > What does this mean??
> >
> > Rohan
> >
> > Quoting Chris Fields :
> >
> >> Okay, had to fix an odd bug from ELink due to the way NCBI returns
> >> data.
> >>
> >> You'll need to update the EUtilities modules in bioperl from CVS
> >> to make
> >> sure this works.
> >>
> >> This is how it's done:
>
----------------------------------------
This mail sent through www.mywaterloo.ca
From cjfields at uiuc.edu Wed Jul 26 01:01:41 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 26 Jul 2006 00:01:41 -0500
Subject: [Bioperl-l] SNP reference file download
In-Reply-To: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca>
References: <000001c6b01f$bfd54e20$15327e82@pyrimidine>
<1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca>
<775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu>
<1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca>
Message-ID:
The below ID doesn't have any OMIM linked data, hence the warning.
The problem is that NCBI, when it doesn't find a link, doesn't send
something constructive to tell you that. It sends the original ID
encoded in XML, but no actual DB's or ID data links. That's what the
warning means. I'll make the original warning a bit more direct: No
databases returned; no IDs found.
The thrown error is from a logic problem; I have fixed it and
committed to CVS.
Here's the web page: no OMIM data there either...
http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=4986950
Try changing your ID list to this:
my @ids = qw(4986950 1800562);
You should get back only one ID (only one has an OMIM number).
By the way, the SNP data ID is only the digits (don't include the
'rs' designation).
Chris
On Jul 25, 2006, at 9:44 PM, vrramnar at student.cs.uwaterloo.ca wrote:
>
> Hey Chris,
>
> I believe I updated all those modules already as I downloaded the
> entire DB.tar
> from Bioperl live. Here is my code:
>
> #!/usr/bin/perl -w
>
> use Bio::Perl;
> use Bio::DB::EUtilities;
>
> my @ids = qw(rs4986950);
> # With the "rs" before the number the warning says: "no returned
> links"
> # Without the "rs" before the number the warning says: "No
> databases returned;
> empty linkset"
>
>
> my $elink = Bio::DB::EUtilities->new( -eutil => 'elink',
> -id => \@ids,
> -db => 'omim',
> -dbfrom => 'snp');
> $elink->get_response;
> print "IDs: ", join q(,), $elink->get_ids;
>
> Which gives the following error:
>
> -------------------- WARNING ---------------------
> MSG: No databases returned; empty linkset
> ---------------------------------------------------
>
> ------------- EXCEPTION -------------
> MSG: Must use database to access IDs
> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/
> Perl/5.8.6/Bio/
> DB/EUtilities/ElinkData.pm:201
> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/
> EUtilities.pm:482
> STACK toplevel getOmimNum:13
>
> --------------------------------------
>
> All I really want is the OMIM id number under the section: NCBI
> Resource Links
> from the page:
> http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562
>
> Any idea why this still isn't working??
>
> Rohan
>
>
> Quoting Chris Fields :
>
>> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was
>> wrong. I plan on changing this to a more robust parser soon (likely
>> XML::SAX or XML::Twig, which will also require a download).
>>
>> That warning occurs when if you don't have a link to OMIM present (No
>> databases returned; empty linkset). The way Elink works is it stores
>> internal data in a separate object (ELinkData) contained in an
>> internal cache. The method get_ids() works for all EUtilities to
>> retrieve IDs, even from ELink objects. The unique problem with ELink
>> is, since you can search multiple databases. you can retrieve
>> multiple sets of IDs.
>>
>> If you haven't done it, update your EUtilities; the problem is
>> similar to one I fixed today (I stated something about updating in my
>> last post). Also, update the main Bio::DB::EUtilities and
>> Bio::GenericWebDBI as well (the last is the base class from which
>> EUtilities is based). The 'Count:1' was a debugging statement I
>> forgot to remove a while ago which I changed in CVS yesterday. It's
>> possible that commit had other changes which I forgot about.
>>
>> Sorry about that, but it is still experimental (emphasis on the
>> 'mental').
>>
>> Chris
>>
>> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote:
>>
>>>
>>> Hey Chris,
>>>
>>> Ignore the last email, I fixed that problem and downloaded/
>>> installed the
>>> required XML modules.
>>>
>>> However, I am now getting this error message:
>>>
>>> -------------------- WARNING ---------------------
>>> MSG: No databases returned; empty linkset
>>> ---------------------------------------------------
>>> Count: 1
>>>
>>> ------------- EXCEPTION -------------
>>> MSG: Must use database to access IDs
>>> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/
>>> Perl/5.8.6/Bio/
>>> DB/EUtilities/ElinkData.pm:201
>>> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/
>>> EUtilities.pm:483
>>> STACK toplevel getOmimNum:15
>>>
>>> --------------------------------------
>>>
>>> What does this mean??
>>>
>>> Rohan
>>>
>>> Quoting Chris Fields :
>>>
>>>> Okay, had to fix an odd bug from ELink due to the way NCBI returns
>>>> data.
>>>>
>>>> You'll need to update the EUtilities modules in bioperl from CVS
>>>> to make
>>>> sure this works.
>>>>
>>>> This is how it's done:
>>
>
>
>
>
> ----------------------------------------
> This mail sent through www.mywaterloo.ca
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Wed Jul 26 05:19:29 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 10:19:29 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk>
<1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
Message-ID: <44C733A1.9070201@sendu.me.uk>
Chris Fields wrote:
>
>> It seems like the main problem with Node right now is that it has
>> classification() and things like genus(). I propose pure Node method
>> solutions to answer the questions classification() and genus() were
>> implemented to answer, but in a better, cruft-free way.
>>
>> Bio::DB::Taxonomy::genbank anyone?
>
> Ach... You're compromising here;
No, I don't think so. Let me explain...
(another very long email, but with the same conclusion as above)
> 1) Switch out Bio::Species with Node or Taxonomy; relocate other
> information temporarily (Bio::Species, get/sets in Seq object,
> SimpleValue). Leave Bio::Species in for the time being, but don't
> bother making any additional changes to it.
[...]
> Hence Hilmar's suggestion to use a $seq->taxon() method to return a
> Node/Taxonomy, and a $seq->species() would still return a
> Bio::Species object. It's redundant,
As I see it, the problem to be solved is this:
a) A node should just be a node, holding only information about itself
(but this can include information on who its parent is, and methods
relating to getting its parents/children as new objects - but the data
of its parents/children must never be stored on itself).
b) Bio::Species isn't very good at its job; you can't ask reasonable
taxonomic questions of it and get correct answers.
c) We need to transition Bio::Species to something better - something
that lets us do the same job as Bio::Species, but do it better. An
important aspect of 'better' is that we can switch from the taxonomic
information in a genbank file or similar to the information in a
taxonomic database if we want certain taxonomic questions answered
correctly. But also, we should be able to answer all questions with a
good chance of a correct answer even without database access/installation.
There are a variety of possible solutions. How can we decide which is
best? What would a good solution be?
The 'something better' we transition Bio::Species to will become the
preferred (or at least de facto standard) way of dealing with taxonomic
information in bioperl. This taxonomic module (or set of modules) must
be able to model taxonomic information anywhere it is found - databases
or genbank files or anything else. If it can't, it would be
fundamentally flawed.
d) We can immediately discount any solution that involves storing some
taxonomic information outside of the tax module. If we find ourselves
putting lineage data in a genbank file in SimpleValue objects or
similar, we can be pretty sure we've used a poor solution to the
problem. That would be a compromise.
e) If the thing we transition Bio::Species to can't do everything
Bio::Species did (doing it in a different and better way is fine of
course), it's not suitable for transitioning to (this is why Node needed
all the cruft added to it before it was a suitable candidate). If it
/can/ do everything Bio::Species did, there would be no harm immediately
making Bio::Species inherit from the new tax module, reimplementing
Bio::Species as necessary but making no API change. So any solution that
would /require/ $seq->taxon() and $seq->species() wouldn't be a good
one, and would be a compromise. But we do want to get rid of
Bio::Species eventually, so I'm not saying we shouldn't have a
$seq->taxon() or similar, only that either method would give you the
same type of object with the same methods available
($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species')
&& $seq->species->isa('tax module')).
I see 2 possible solutions to the problem. What should 'tax module' be?:
1) Bio::Taxonomy or other similar class that is a container of multiple
nodes. Naively this makes logical sense since one of the jobs
Bio::Species has is to store a lineage, and a lineage is best
represented as a set of Nodes. So let's have a single object with all
our Nodes in it. Problems:
Bio::Taxonomy itself, as currently written, is fundamentally flawed. It
requires that you know the ranks and order of ranks of all your input
nodes before you input them. It requires that all ranks have unique
names. It doesn't handle ranks of 'no rank'. You can't have more than
one lineage in an instance because you can't have two nodes with the
same rank. If you don't know the ranks of your nodes (ie. genbank) there
is no way to maintain the order of your lineage because there is no
modelling of parent/child.
I had planned to re-write it such that the rank-centric implementation
was removed and we had parent/child implementation instead. But then
there is nothing to stop you adding nodes that are disconnected from the
others, creating a broken mess.
Bio::Taxonomy::Tree might have been a little more suitable because it
implements Bio::Tree::TreeI, but sadly it is also rank-centric and
actually requires input of both Bio::Species and Bio::Taxonomy objects
to its most useful methods.
More important than issues with current implementations of
node-container classes, such classes are unable to let us solve problem
c) in a good way, and also leave us potentially storing in memory Node
objects representing the same taxonomic node multiple times in different
instances of the node-container. For problem c) if we were to switch
from genbank nodes to database the solution is to delete all the nodes
in the container and then get them all again from the database. What if
you didn't even have a lineage-related question? You've just retrieved
10s of nodes from the database for no reason (and then store them), when
all you wanted was accurate information on the node you were interested in.
All in all, it's pretty horrible. Unsuitable implementations plus excess
database retrieval plus massive waste of memory with duplicated nodes
does not equal a good solution.
2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
methods binomial(), species(), genus(), sub_species(),
variant(), organelle(), classification() and show_all(). Except for
organelle() which doesn't belong in taxonomy, all of these Bio::Species
'questions' can still be answered by Node - just not in a single method
call. I outlined how to answer them in the previous post. For backward
compatibility make Bio::Species a Node and implement the suggested way
of answering the questions the proper 'Node' way under those methods.
Problems:
Well, those questions can't actually be answered by Node if the starting
point was genbank data or manually created Nodes. The solution is clean
and simple: Bio::DB::Taxonomy::genbank or perhaps better named
Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
ordered list of names - I don't see anything inherently wrong or ugly
with that). Then everything magically just works. We get all the power
to ask all our questions that Node has already when working with the
ncbi database, but we get it when working with genbank data. We suffer
none of the problems of a node-container class. We can easily switch
databases on the fly.
What's not to like?
From bix at sendu.me.uk Wed Jul 26 06:00:01 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 11:00:01 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net>
References: <000001c6b017$873176a0$15327e82@pyrimidine>
<9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net>
Message-ID: <44C73D21.3010301@sendu.me.uk>
Hilmar Lapp wrote:
> Instead, create something like
>
> # return a Bio::Taxonomy::Node:
> my $taxon = $seq->taxon();
Yes, but $seq->species() would also
> # alternative approach: return a lineage (taxonomy)
> # this would be Bio::TaxonomyI compliant
> my $lineage = $seq->lineage();
I've since come to the conclusion that anything Taxonomy-ish would be
inappropriate - see recent post.
> The former would require the lineage (and organelle for completeness)
> information to be either easily (though not necessarily directly)
> accessible through the node, or added as annotation.
That specifically is the main problem with Node as it is now. You
shouldn't store information about the lineage (essentially information
about other nodes) on the node object itself. Storing it as annotation
on the Node or elsewhere is terrible: you lose all the power of Node and
can no longer ask any lineage-related questions.
There is no need for this split in functionality - when you don't have
database access and just some genbank files, you can't answer any
taxonomic questions involving lineage, vs. when you do have database
access suddenly you can start doing useful things.
My proposed solution is that bioperl's taxonomy model always lets you
answer the same questions regardless of your source for taxonomic
information - see recent post.
From cjfields at uiuc.edu Wed Jul 26 08:16:29 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 26 Jul 2006 07:16:29 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C733A1.9070201@sendu.me.uk>
References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk>
<1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
<44C733A1.9070201@sendu.me.uk>
Message-ID: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu>
> ...
>
> I see 2 possible solutions to the problem. What should 'tax module'
> be?:
>
> 1) Bio::Taxonomy or other similar class that is a container of
> multiple
> nodes. Naively this makes logical sense since one of the jobs
> Bio::Species has is to store a lineage, and a lineage is best
> represented as a set of Nodes. So let's have a single object with all
> our Nodes in it. Problems:
>
> Bio::Taxonomy itself, as currently written, is fundamentally
> flawed. It
> requires that you know the ranks and order of ranks of all your input
> nodes before you input them. It requires that all ranks have unique
> names. It doesn't handle ranks of 'no rank'. You can't have more than
> one lineage in an instance because you can't have two nodes with the
> same rank. If you don't know the ranks of your nodes (ie. genbank)
> there
> is no way to maintain the order of your lineage because there is no
> modelling of parent/child.
> I had planned to re-write it such that the rank-centric implementation
> was removed and we had parent/child implementation instead. But then
> there is nothing to stop you adding nodes that are disconnected
> from the
> others, creating a broken mess.
>
> Bio::Taxonomy::Tree might have been a little more suitable because it
> implements Bio::Tree::TreeI, but sadly it is also rank-centric and
> actually requires input of both Bio::Species and Bio::Taxonomy objects
> to its most useful methods.
>
> More important than issues with current implementations of
> node-container classes, such classes are unable to let us solve
> problem
> c) in a good way, and also leave us potentially storing in memory Node
> objects representing the same taxonomic node multiple times in
> different
> instances of the node-container. For problem c) if we were to switch
> from genbank nodes to database the solution is to delete all the nodes
> in the container and then get them all again from the database.
> What if
> you didn't even have a lineage-related question? You've just retrieved
> 10s of nodes from the database for no reason (and then store them),
> when
> all you wanted was accurate information on the node you were
> interested in.
>
> All in all, it's pretty horrible. Unsuitable implementations plus
> excess
> database retrieval plus massive waste of memory with duplicated nodes
> does not equal a good solution.
>
>
> 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
> methods binomial(), species(), genus(), sub_species(),
> variant(), organelle(), classification() and show_all(). Except for
> organelle() which doesn't belong in taxonomy, all of these
> Bio::Species
> 'questions' can still be answered by Node - just not in a single
> method
> call. I outlined how to answer them in the previous post. For backward
> compatibility make Bio::Species a Node and implement the suggested way
> of answering the questions the proper 'Node' way under those methods.
> Problems:
>
> Well, those questions can't actually be answered by Node if the
> starting
> point was genbank data or manually created Nodes. The solution is
> clean
> and simple: Bio::DB::Taxonomy::genbank or perhaps better named
> Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
> ordered list of names - I don't see anything inherently wrong or ugly
> with that). Then everything magically just works. We get all the power
> to ask all our questions that Node has already when working with the
> ncbi database, but we get it when working with genbank data. We suffer
> none of the problems of a node-container class. We can easily switch
> databases on the fly.
That 'broken mess' (referring to Bio::Taxonomy) is up to the user.
You could make it more stringent (i.e. only allow connected nodes,
starting with a single initiating node then build from there), though
I don't think that's necessary as most people would probably use some
sort of factory to generate a taxonomy (a warning might be
appropriate). You would have to watch out for potential circular
structures. Have it do what you want. I believe the original intent
of Taxonomy was to allow building a full-fledged taxonomic structure,
so it should stay that way.
Sendu, you have to realize this is up to how you want to implement
it. We're giving you the freedom to do what you want to
Bio::Taxonomy. Of course, if we think you're off we'll reel you back
in, but you seem to be on the right track. Realize that the only
contentious issue here is that horrible lineage line in the GenBank
file. We should have a way to rebuild it as it was from the original
file (i.e. not rebuild it from scratch with DB lookups by default).
However, you should also have the option to rebuild it from lookups
(i.e. correctly), which you could do with a Taxonomy.
Note this Bio::Taxonomy method:
classify
Title : classify
Usage : @obj[][0-1] = taxonomy->classify($species);
Function: return a ranked classification
Returns : @obj of taxa and ranks as word pairs separated by "@"
Args : Bio::Species object
As Bio::Species will be deprecated, you can use that method in a
dual, sneaky way: 1) directly store the lineage information, 2)
return the real one (DB lookups) if needed (i,e, if some flag is set,
for instance). And, if a Bio::Species argument is used, do what the
docs state (catch it early on with an if block and return within
it). Bio::Species, as used within genbank.pm, doesn't use
Bio::Taxonomy in any way. I don't know if you even need to retain
its original purpose here; you might be able to get away with
changing the fundamental way this method works altogether. That's up
to you.
my 2c
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
From bix at sendu.me.uk Wed Jul 26 08:49:05 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 13:49:05 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu>
References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk>
<1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
<44C733A1.9070201@sendu.me.uk>
<0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu>
Message-ID: <44C764C1.9010804@sendu.me.uk>
Chris Fields wrote:
> We're giving you the freedom to do what you want to Bio::Taxonomy.
I don't want to do anything with Bio::Taxonomy any more. I've already
shown that it isn't suitable for the job. Regardless of how it is
implemented, the entire idea of a class that contains Nodes isn't
appropriate, for reasons already stated.
> Realize that the only contentious issue here is
> that horrible lineage line in the GenBank file. We should have a way to
> rebuild it as it was from the original file (i.e. not rebuild it from
> scratch with DB lookups by default). However, you should also have the
> option to rebuild it from lookups (i.e. correctly), which you could do
> with a Taxonomy.
And I've already shown how rebuilding with a Taxonomy is very far from
ideal, while switching db_handle on a Node would be perfect. Why are you
now advocating Taxonomy when there is no reason to?
> Note this Bio::Taxonomy method:
>
> classify
>
> Title : classify
> Usage : @obj[][0-1] = taxonomy->classify($species);
> Function: return a ranked classification
> Returns : @obj of taxa and ranks as word pairs separated by "@"
> Args : Bio::Species object
Note that all this method does is let you combine a list of rank names
with the classification array in a Bio::Species, spitting out some weird
data structure. It is only of interest to Bio::Taxonomy::Tree.
We're in the situation where we don't know the rank names corresponding
to the classification array in a Bio::Species generated by genbank et
al. So classify() is of zero value.
> As Bio::Species will be deprecated, you can use that method in a dual,
> sneaky way: 1) directly store the lineage information,
No. Lineage information must be in the form of Nodes or you can't answer
lineage-related taxonomic questions.
> 2) return the real one (DB lookups) if needed
Messy. Doing it with Node would be far superior.
Again, Node works all the time, while Taxonomy would work badly or not
at all some of the time. Rather than suggest ways of using Taxonomy,
tell me what is wrong with my current Node plan.
From cjfields at uiuc.edu Wed Jul 26 11:15:28 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 26 Jul 2006 10:15:28 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C764C1.9010804@sendu.me.uk>
Message-ID: <002801c6b0c6$59279fa0$15327e82@pyrimidine>
I advocate anything but Bio::Species that allows you the option to use
lookups for correct taxonomic information and not guesswork (current
Bio::Species). So, you could pretty much replace Species immediately with a
DB-aware container object with simple get/sets. As of now, that would be
that Node or Taxonomy. I have done this already, just haven't committed it
yet. And, when I mentioned having freedom to do what you want with
Bio::Taxonomy, that includes all of it (including Node, Tree, etc). We just
want it to be reasonable and not 'duct tape' for the various Bio::Species
mistakes of the past.
I don't think the problem here is really that complicated (still, the only
thing is the lineage stuff in a sequence file, right?).
> > As Bio::Species will be deprecated, you can use that method in a dual,
> > sneaky way: 1) directly store the lineage information,
>
> No. Lineage information must be in the form of Nodes or you can't answer
> lineage-related taxonomic questions.
You must have a way to store the 'horrible lineage information' data, as is,
for those users who do not care about taxonomy and just want to convert seq
streams. You shouldn't burden the everyday user with something that is
pretty specialized, this being finding correct taxonomic information based
on DB lookups for a particular reason (screening sequences, as Hilmar
pointed out, was one possibility).
I don't care how, but store lineage information as it appears in the file
(scalar string) or in a simple data structure (array, maybe?) capable of
retaining the information in some way. There are many many ways of doing
this which I have previously pointed out; take your pick.
Hilmar, in a previous post, told me to take a step back and contemplate a
world w/o Bio::Species, where you would design a system capable of dealing
with sequence file taxonomic data in a way that allows you to get correct
tax information when needed via NCBI Taxonomy data, yet not sacrifice speed
if you're just interested in converting sequences via SeqIO. Would you
design a Bio::Species class, then? Would you attempt to spend time parsing
out species and genus information, when the correct data is sitting on the
NCBI server or in a local flatfile? No. You would retain the minimal data
necessary in an object for reading and writing data, but have the >option<
available to run a lookup. Therefore, Bio::Taxonomy::Node was born. A
little prematurely, yes. Probably needed to bake a bit more...
Anyway, we must eventually sever our reliance on Bio::Species in order to
deprecate it, so the lineage information must be contained, as it appears in
the file, somewhere else.
And my point with the classify() Bio::Taxonomy method is not to use it as
is; you could sneak in your own data if needed. It was an example of a
possible way of containing the lineage data, but not meant to be an absolute
way. It's up to you how you want to implement it.
I think the classes that are currently in place are more than capable of
handling the job. Hence my statement before that you are trying to get too
many things going right out the starting gate. Start simply by replacing
Bio::Species, then worry about other issues. If you think that a
specialized class would work, fine, but IMHO I don't think it's absolutely
necessary. I had proposed such a class before (more like a
Bio::Species-like Tax object) but was shut down, and rightly so; it's
unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra
unnecessary methods (classification(), genus(), and so on).
My last proposal was to eventually strip out the unreliable taxonomic
parsing in the various SeqIO modules and replace it with something simple,
which seemed to be a consensus among us all. This has to do with Hilmar's
post-apocalyptic vision of a Bio::Species-free world. That will eventually
happen, and Bioperl will eventually switch over completely to
Bio::Taxonomy::Whatever. And Bio::Species can join BPLite and other
deprecated modules in the BioPerl Boot Hill.
But, for now that can't happen. We all strive for the best information
possible. However, you can't sacrifice the needs of other users, a majority
whom probably care squat about taxonomy, with your (our) own needs. As I
have repeatedly stated, simple is good. We can't just usurp the API for our
own wishes w/o warning, so the change has to be gradual and Bio::Species
must stick around for the time being. And we must make it optional to have
DB lookups or the villagers will be storming the castle.
Listen, Sendu. If you can wait a couple of weeks for further discussion
then we can slog on with this. But right now I just don't have any more
time for this, sorry. You can have the last word and I'll respond when I
get back.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Wednesday, July 26, 2006 7:49 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Chris Fields wrote:
> > We're giving you the freedom to do what you want to Bio::Taxonomy.
>
> I don't want to do anything with Bio::Taxonomy any more. I've already
> shown that it isn't suitable for the job. Regardless of how it is
> implemented, the entire idea of a class that contains Nodes isn't
> appropriate, for reasons already stated.
>
>
> > Realize that the only contentious issue here is
> > that horrible lineage line in the GenBank file. We should have a way to
> > rebuild it as it was from the original file (i.e. not rebuild it from
> > scratch with DB lookups by default). However, you should also have the
> > option to rebuild it from lookups (i.e. correctly), which you could do
> > with a Taxonomy.
>
> And I've already shown how rebuilding with a Taxonomy is very far from
> ideal, while switching db_handle on a Node would be perfect. Why are you
> now advocating Taxonomy when there is no reason to?
>
>
> > Note this Bio::Taxonomy method:
> >
> > classify
> >
> > Title : classify
> > Usage : @obj[][0-1] = taxonomy->classify($species);
> > Function: return a ranked classification
> > Returns : @obj of taxa and ranks as word pairs separated by "@"
> > Args : Bio::Species object
>
> Note that all this method does is let you combine a list of rank names
> with the classification array in a Bio::Species, spitting out some weird
> data structure. It is only of interest to Bio::Taxonomy::Tree.
> We're in the situation where we don't know the rank names corresponding
> to the classification array in a Bio::Species generated by genbank et
> al. So classify() is of zero value.
>
>
> > As Bio::Species will be deprecated, you can use that method in a dual,
> > sneaky way: 1) directly store the lineage information,
>
> No. Lineage information must be in the form of Nodes or you can't answer
> lineage-related taxonomic questions.
>
>
> > 2) return the real one (DB lookups) if needed
>
> Messy. Doing it with Node would be far superior.
>
>
> Again, Node works all the time, while Taxonomy would work badly or not
> at all some of the time. Rather than suggest ways of using Taxonomy,
> tell me what is wrong with my current Node plan.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From morissardj at gmail.com Wed Jul 26 10:59:54 2006
From: morissardj at gmail.com (Morissard =?utf-8?b?asOpcm9tZQ==?=)
Date: Wed, 26 Jul 2006 14:59:54 +0000 (UTC)
Subject: [Bioperl-l] Accessing TRANSFAC matrices
References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de>
<44BEA9FB.1070009@utk.edu>
Message-ID:
Hi
that may help you ?
http://morissardjerome.free.fr/Data/files/matrices.zip
From hlapp at gmx.net Wed Jul 26 11:36:32 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 26 Jul 2006 11:36:32 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C73D21.3010301@sendu.me.uk>
References: <000001c6b017$873176a0$15327e82@pyrimidine>
<9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net>
<44C73D21.3010301@sendu.me.uk>
Message-ID:
On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> Instead, create something like
>>
>> # return a Bio::Taxonomy::Node:
>> my $taxon = $seq->taxon();
>
> Yes, but $seq->species() would also
$seq->species() would return a Bio::Species object which may not be
more than a thin shell anymore around an implementation that
delegates almost everything to a lineage object (Bio::Taxonomy).
$seq->taxon() in contrast need not return such a backwards-compatible
construct.
>
>> # alternative approach: return a lineage (taxonomy)
>> # this would be Bio::TaxonomyI compliant
>> my $lineage = $seq->lineage();
>
> I've since come to the conclusion that anything Taxonomy-ish would be
> inappropriate - see recent post.
Not sure which one you mean, and please don't reference really long
emails, you're asking a lot of other people to organize your thoughts
for them.
At any rate, my point is that if you only name it appropriately you
can avoid misconceptions about what is being returned. The fact that
it's confusing to return a taxonomy from a method called species()
doesn't mean it's equally bad to return a lineage (which is a limited
taxonomy) from a method called lineage().
> [...]
>
> My proposed solution is that bioperl's taxonomy model always lets you
> answer the same questions regardless of your source for taxonomic
> information - see recent post.
See above ... And I'd rather see some code or API examples than
extensive elaborations.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From hlapp at gmx.net Wed Jul 26 11:38:50 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 26 Jul 2006 11:38:50 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C733A1.9070201@sendu.me.uk>
References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk>
<1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
<44C733A1.9070201@sendu.me.uk>
Message-ID: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net>
On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote:
> Chris Fields wrote:
>>
>>> It seems like the main problem with Node right now is that it has
>>> classification() and things like genus(). I propose pure Node method
>>> solutions to answer the questions classification() and genus() were
>>> implemented to answer, but in a better, cruft-free way.
>>>
>>> Bio::DB::Taxonomy::genbank anyone?
>>
>> Ach... You're compromising here;
>
> No, I don't think so. Let me explain...
> (another very long email, but with the same conclusion as above)
Sorry, can you summarize this in a few sentences? If you do want
feedback from me you really need to be more concise.
-hilmar
>
>
>> 1) Switch out Bio::Species with Node or Taxonomy; relocate other
>> information temporarily (Bio::Species, get/sets in Seq object,
>> SimpleValue). Leave Bio::Species in for the time being, but don't
>> bother making any additional changes to it.
> [...]
>> Hence Hilmar's suggestion to use a $seq->taxon() method to return a
>> Node/Taxonomy, and a $seq->species() would still return a
>> Bio::Species object. It's redundant,
>
> As I see it, the problem to be solved is this:
>
> a) A node should just be a node, holding only information about itself
> (but this can include information on who its parent is, and methods
> relating to getting its parents/children as new objects - but the data
> of its parents/children must never be stored on itself).
>
> b) Bio::Species isn't very good at its job; you can't ask reasonable
> taxonomic questions of it and get correct answers.
>
> c) We need to transition Bio::Species to something better - something
> that lets us do the same job as Bio::Species, but do it better. An
> important aspect of 'better' is that we can switch from the taxonomic
> information in a genbank file or similar to the information in a
> taxonomic database if we want certain taxonomic questions answered
> correctly. But also, we should be able to answer all questions with a
> good chance of a correct answer even without database access/
> installation.
>
> There are a variety of possible solutions. How can we decide which is
> best? What would a good solution be?
>
> The 'something better' we transition Bio::Species to will become the
> preferred (or at least de facto standard) way of dealing with
> taxonomic
> information in bioperl. This taxonomic module (or set of modules) must
> be able to model taxonomic information anywhere it is found -
> databases
> or genbank files or anything else. If it can't, it would be
> fundamentally flawed.
>
> d) We can immediately discount any solution that involves storing some
> taxonomic information outside of the tax module. If we find ourselves
> putting lineage data in a genbank file in SimpleValue objects or
> similar, we can be pretty sure we've used a poor solution to the
> problem. That would be a compromise.
>
> e) If the thing we transition Bio::Species to can't do everything
> Bio::Species did (doing it in a different and better way is fine of
> course), it's not suitable for transitioning to (this is why Node
> needed
> all the cruft added to it before it was a suitable candidate). If it
> /can/ do everything Bio::Species did, there would be no harm
> immediately
> making Bio::Species inherit from the new tax module, reimplementing
> Bio::Species as necessary but making no API change. So any solution
> that
> would /require/ $seq->taxon() and $seq->species() wouldn't be a good
> one, and would be a compromise. But we do want to get rid of
> Bio::Species eventually, so I'm not saying we shouldn't have a
> $seq->taxon() or similar, only that either method would give you the
> same type of object with the same methods available
> ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species')
> && $seq->species->isa('tax module')).
>
>
> I see 2 possible solutions to the problem. What should 'tax module'
> be?:
>
> 1) Bio::Taxonomy or other similar class that is a container of
> multiple
> nodes. Naively this makes logical sense since one of the jobs
> Bio::Species has is to store a lineage, and a lineage is best
> represented as a set of Nodes. So let's have a single object with all
> our Nodes in it. Problems:
>
> Bio::Taxonomy itself, as currently written, is fundamentally
> flawed. It
> requires that you know the ranks and order of ranks of all your input
> nodes before you input them. It requires that all ranks have unique
> names. It doesn't handle ranks of 'no rank'. You can't have more than
> one lineage in an instance because you can't have two nodes with the
> same rank. If you don't know the ranks of your nodes (ie. genbank)
> there
> is no way to maintain the order of your lineage because there is no
> modelling of parent/child.
> I had planned to re-write it such that the rank-centric implementation
> was removed and we had parent/child implementation instead. But then
> there is nothing to stop you adding nodes that are disconnected
> from the
> others, creating a broken mess.
>
> Bio::Taxonomy::Tree might have been a little more suitable because it
> implements Bio::Tree::TreeI, but sadly it is also rank-centric and
> actually requires input of both Bio::Species and Bio::Taxonomy objects
> to its most useful methods.
>
> More important than issues with current implementations of
> node-container classes, such classes are unable to let us solve
> problem
> c) in a good way, and also leave us potentially storing in memory Node
> objects representing the same taxonomic node multiple times in
> different
> instances of the node-container. For problem c) if we were to switch
> from genbank nodes to database the solution is to delete all the nodes
> in the container and then get them all again from the database.
> What if
> you didn't even have a lineage-related question? You've just retrieved
> 10s of nodes from the database for no reason (and then store them),
> when
> all you wanted was accurate information on the node you were
> interested in.
>
> All in all, it's pretty horrible. Unsuitable implementations plus
> excess
> database retrieval plus massive waste of memory with duplicated nodes
> does not equal a good solution.
>
>
> 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
> methods binomial(), species(), genus(), sub_species(),
> variant(), organelle(), classification() and show_all(). Except for
> organelle() which doesn't belong in taxonomy, all of these
> Bio::Species
> 'questions' can still be answered by Node - just not in a single
> method
> call. I outlined how to answer them in the previous post. For backward
> compatibility make Bio::Species a Node and implement the suggested way
> of answering the questions the proper 'Node' way under those methods.
> Problems:
>
> Well, those questions can't actually be answered by Node if the
> starting
> point was genbank data or manually created Nodes. The solution is
> clean
> and simple: Bio::DB::Taxonomy::genbank or perhaps better named
> Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
> ordered list of names - I don't see anything inherently wrong or ugly
> with that). Then everything magically just works. We get all the power
> to ask all our questions that Node has already when working with the
> ncbi database, but we get it when working with genbank data. We suffer
> none of the problems of a node-container class. We can easily switch
> databases on the fly.
>
> What's not to like?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From jay at jays.net Wed Jul 26 11:32:53 2006
From: jay at jays.net (Jay Hannah)
Date: Wed, 26 Jul 2006 08:32:53 -0700
Subject: [Bioperl-l] Anyone else at OSCON right now?
Message-ID: <44C78B25.80503@jays.net>
Any other BioPerl'ers here in Portland for OSCON?
I'd love to chat about your life w/ BioPerl.
I'm here until Saturday morning.
j
http://oscon.kwiki.org/index.cgi?JayHannah
From adamnkraut at gmail.com Wed Jul 26 10:32:42 2006
From: adamnkraut at gmail.com (Adam Kraut)
Date: Wed, 26 Jul 2006 10:32:42 -0400
Subject: [Bioperl-l] Structure::IO
In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com>
References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com>
Message-ID: <134ede0b0607260732u79f0dea2if8f4ea98a5e03524@mail.gmail.com>
Hi bernd,
Can you better explain what it is you want to do with pdb files? From your
example it looks like you want to do something with each chain, but it is
unclear what you want to do here:
my @chains = $struc->chain($chain);
With that said, I was never able to use Bio::Structure in the way that I
wanted. I now use the MMTSB Perl libraries instead:
http://mmtsb.scripps.edu/cgi-bin/tooldoc?perlpackages
Specifically the Molecule module may be useful here.
Regards,
Adam
On 7/25/06, Bernd Web wrote:
>
> Hi,
>
> Does someone have experience with Bio::Structure::IO?
> The example III.9.1 from the bptutorial.pl covers most, but what is e.g.
> the
> chain() method of Bio::Structure::Entry doing? The POD states:
>
> Title : chain
> Usage : @chains = $structure->chain($chain);
> Function: Connects a (or a list of) Chain objects to a
> Bio::Structure::Entry.
> Returns : list of Bio::Structure::Residue objects
> Args : One Residue or a reference to an array of Residue objects
>
> But in e.g
> my $stream = Bio::Structure::IO->new(-file => $filename,
> -format => 'pdb');
> while ( my $struc = $stream->next_structure() ) {
> for my $chain ($struc->get_chains) {
> my $chainid = $chain->id;
> my @chains = $struc->chain($chain);
> }
> }
>
> I get Bio::Structure::Chain=HASH(0x9f1ab50).
>
> What is the function of the chain method and how to use it?
>
> Best regards,
> bernd
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
Adam N. Kraut
National Resource for Biomedical Supercomputing
http://www.nrbsc.org/sb/
From bix at sendu.me.uk Wed Jul 26 12:11:25 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 17:11:25 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <002801c6b0c6$59279fa0$15327e82@pyrimidine>
References: <002801c6b0c6$59279fa0$15327e82@pyrimidine>
Message-ID: <44C7942D.6050603@sendu.me.uk>
Chris Fields wrote:
>> No. Lineage information must be in the form of Nodes or you can't answer
>> lineage-related taxonomic questions.
>
> You must have a way to store the 'horrible lineage information' data, as is,
> for those users who do not care about taxonomy and just want to convert seq
> streams. You shouldn't burden the everyday user with something that is
> pretty specialized, this being finding correct taxonomic information based
> on DB lookups for a particular reason (screening sequences, as Hilmar
> pointed out, was one possibility).
I am certainly not requiring that anyone find 'correct taxonomic
information'. The whole reason I am backing my current proposal is that
it works equally well with or without access to NCBI's taxonomy
database. Your proposals work poorly without access to such.
> I don't care how, but store lineage information as it appears in the file
> (scalar string) or in a simple data structure (array, maybe?) capable of
> retaining the information in some way. There are many many ways of doing
> this which I have previously pointed out; take your pick.
I've taken my pick.
To set:
my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @lineage);
$node->db_handle($db);
To get:
@lineage = map { $_->scientific_name } $node->get_Lineage_Nodes;
That is as simple as it is going to get in a world where we have 'pure'
Nodes or any other kind of pure taxonomic class.
If you want to hide the taxonomic complexity from end-users who want to
make and store their own lineage of their species without having to know
the details of how bioperl's taxonomy modules are supposed to work, tell
them to use Bio::Species:
To set:
$species->classification(@lineage);
To get:
@lineage = $species->classification;
Of course in this example I propose that behind the scenes Bio::Species
is a Bio:Taxonomy::Node and just implements classification() the pure
Node way, given above.
Let me make my requirement very clear: the solution must allow you to
find the most recent common ancestor of two solution-objects without
access to the NCBI taxonomy database, using exactly the same method call
you would use if you /did/ have access to the NCBI taxonomy database.
The method in question shouldn't need any special-case code depending on
the presence or absence of NCBI taxonomy database.
That's the litmus test. I'll tend to reject any solution that fails.
From bix at sendu.me.uk Wed Jul 26 12:25:41 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 17:25:41 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net>
References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk>
<1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu>
<44C733A1.9070201@sendu.me.uk>
<3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net>
Message-ID: <44C79785.6050705@sendu.me.uk>
Hilmar Lapp wrote:
>
> On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote:
>
>>>> It seems like the main problem with Node right now is that it has
>>>> classification() and things like genus(). I propose pure Node method
>>>> solutions to answer the questions classification() and genus() were
>>>> implemented to answer, but in a better, cruft-free way.
>>>>
>>>> Bio::DB::Taxonomy::genbank anyone?
>
> Sorry, can you summarize this in a few sentences? If you do want
> feedback from me you really need to be more concise.
A bad solution-module stores any kind of taxonomic information outside
of the solution-module or in an inconsistent form. By 'inconsistent' I
mean, sometimes you store the name of a taxonomic rank with
$node->node_name, other times you store it in an array or scalar held
directly on the solution-module or elsewhere.
Bio::Taxonomy specifically is not usable. Generally speaking, classes
that are containers of multiple nodes are also inappropriate, because
they result in excess database retrieval and excess storage of
duplicated information amongst instances of such classes.
Bio::Taxonomy::Node combined with Bio::DB::Taxonomy::list would probably
be ideal.
From cjfields at uiuc.edu Wed Jul 26 12:49:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 26 Jul 2006 11:49:40 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
Message-ID: <000001c6b0d3$7d936ec0$15327e82@pyrimidine>
Hilmar, apologies ahead of time for not being too concise! It's my last
hurrah on this thread. No, really!
...
> > Yes, but $seq->species() would also
>
> $seq->species() would return a Bio::Species object which may not be
> more than a thin shell anymore around an implementation that
> delegates almost everything to a lineage object (Bio::Taxonomy).
>
> $seq->taxon() in contrast need not return such a backwards-compatible
> construct.
In genbank.pm _read_GenBank_Species (initial implementation, to switch out
Bio::Species with Taxonomy/Node object):
1) Assign data to both Bio::Species (as currently implemented) and
Bio::Taxonomy::Node (new way).
2) Assign organelle to Bio::Species and the Seq object get/set organelle().
3) Assign lineage information to Bio::Species and as an array to the Seq
object get/set lineage().
Replace the get/set above with your method of choice, just no Bio::Species.
In genbank.pm write_seq()
1) if DB_lookup flag is defined, use $seq->taxon() to build lineage
2) If not, use $seq->lineage().
The fine details (how do you build the lineage?!?) can be worked out along
the way. The wonders of CVS!
The Taxonomy class used here could be returned using Hilmar's $seq->taxon()
and Bio::Species can be returned via $seq->species(). Makes perfect sense!
Separated! Nothing complicated about it. Nice and clean. And Bio::Species
can eventually be shown the exit door. Elvis has left the building...
Organelle-specific sequence TaxIDs, as they refer to the organism and not
the organelle, could be placed elsewhere, preferably somewhere more
accessible such as $seq->organelle(). And lineage, similarly, could be
placed in $seq->lineage(), which would store it as a raw string or as an
array. There are many other ways I had pointed out (SimpleValue, Node,
etc); I don't care, as long as we eventually sever the Bio::Species tumor
from SeqIO.
...
> ...And I'd rather see some code or API examples than
> extensive elaborations.
>
> -hilmar
Hilmar's right; working code does speaks louder than words. The energy
spent in writing up full expositions is better spent elsewhere, hence: I
need to get back to work! Wish I could contribute more.
Chris
From bix at sendu.me.uk Wed Jul 26 13:13:43 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Wed, 26 Jul 2006 18:13:43 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk>
Message-ID: <44C7A2C7.2070100@sendu.me.uk>
Hilmar Lapp wrote:
> On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote:
>
>> Hilmar Lapp wrote:
>>> Instead, create something like
>>>
>>> # return a Bio::Taxonomy::Node:
>>> my $taxon = $seq->taxon();
>> Yes, but $seq->species() would also
>
> $seq->species() would return a Bio::Species object which may not be
> more than a thin shell anymore around an implementation that
> delegates almost everything to a lineage object (Bio::Taxonomy).
I actually forgot to finish that sentence. I was going to suggest
Bio::Species isa Bio::Taxonomy::Node and would indeed delegate most of
its implementation to Node.
>>> # alternative approach: return a lineage (taxonomy)
>>> # this would be Bio::TaxonomyI compliant
>>> my $lineage = $seq->lineage();
>> I've since come to the conclusion that anything Taxonomy-ish would be
>> inappropriate - see recent post.
>
> The fact that it's confusing to return a taxonomy from a method called species()
> doesn't mean it's equally bad to return a lineage (which is a limited
> taxonomy) from a method called lineage().
You wouldn't need to though. If you want a lineage you could ask your
node for its lineage. There's no point in having a whole other class
that contains a node and all its ancestor nodes, when to get the
ancestors of a node all you have to do is $node->get_Lineage_Nodes().
>> My proposed solution is that bioperl's taxonomy model always lets you
>> answer the same questions regardless of your source for taxonomic
>> information - see recent post.
>
> See above ... And I'd rather see some code or API examples
The fine details of the following may be slightly off, but it's just to
provide an example. I'll use Test.pm syntax.
my @human = qw('Homo sapiens' Homo Mammalia Eukaryota);
my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota);
Old way with Node
-----------------
my $h_node = new Bio::Taxonomy::Node(-classification => @human);
my $m_node = new Bio::Taxonomy::Node(-classification => @mouse);
@human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
ok @human, 0; # failure to work as expected
@human = $h_node->classification;
ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota";
my $lca = $h_node->get_LCA_Node($m_node);
ok $lca, undef; # failure to do anything useful because our lineage data
# is in an array, not in nodes
# try again with entrez - must make brand new objects
my $db = new Bio::DB::Taxonomy(-source => 'entrez');
$h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens');
$m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus');
@human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group,
Hominidae, ..."; # now it works!
$lca = $h_node->get_LCA_Node($m_node);
ok $lca->scientific_name, 'Mammalia'; # and now this works!
Old way with Bio::Species
-------------------------
# forget about it, Species has nothing like a get_LCA_Node()
Proposed way with Node
----------------------
my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human);
my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens');
$db->add_lineage(@mouse); # or make a new db
my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus');
@human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota";
# works as expected
my $lca = $h_node->get_LCA_Node($m_node);
ok $lca->scientific_name, 'Mammalia'; # works first time
# try again with entrez - just change the db_handle
$h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez');
@human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group,
Hominidae, ...";
$lca = $h_node->get_LCA_Node($m_node);
ok $lca->scientific_name, 'Mammalia';
Proposed way with Bio::Species
------------------------------
# (Bio::Species isa Bio::Taxonomy::Node, implements its methods like
# above)
my $h_species = new Bio::Species(-classification => @human);
my $m_species = new Bio::Species(-classification => @mouse);
@human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota";
@human = $h_species->classification;
ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota";
my $lca = $h_species->get_LCA_Node($m_species);
ok $lca->scientific_name, 'Mammalia';
# trying again with entrez behaves as per proposed Node, above
From angshu96 at gmail.com Wed Jul 26 13:15:35 2006
From: angshu96 at gmail.com (Angshu Kar)
Date: Wed, 26 Jul 2006 12:15:35 -0500
Subject: [Bioperl-l] WUBLASTP parsing problem
Message-ID:
Hi,
Does WU-BLASTP has got something to do with the length of the
sequence names (or the sequence names)?
What is happening here is I use fasta format proteins to build the
blast (I do a distributed blastp) report. But when I parse the
report (using bioperl), the query column remains empty for some
results as :
* 328857 6.6e-135
325331 6.3e-114
325329 1.0e-113
325332 1.7e-113
325330 2.7e-113
.
.
*.
while for some its perfect as:
*267750 280003 7.5e-301
267750 348279 7.5e-301
267750 345867 2.0e-300
267750 251915 2.0e-300
267750 346539 6.7e-300
.
*.
.
Some of my sequences are as:
*IMGA|AC159872_38.1 hypothetical protein AC159872.12 35121-35051 H
EGN_Mt050401 20060209 TIGR 1671.m00013
mrsciilhnmivederdtyaqrwtefeqpggngsstpqpystelrdpdvhhklqtdlvkh
iwikfgmyrd*
*
And part of the blastp (the one where I'm facing the issue) result
is as:
*Smallest
*
* Sum
High
Probability
Sequences producing High-scoring Segment Pairs: Score
P(N) N
gi|33333045|gb|AAQ11687.1| MADS box protein [Triticum aes... 1318
6.6e-135 1
gi|47681327|gb|AAT37484.1| MADS5 protein [Dendrocalamus l... 1120
6.3e-114 1
gi|47681331|gb|AAT37486.1| MADS7 protein [Dendrocalamus l... 1118
1.0e-113 1
gi|47681325|gb|AAT37483.1| MADS4 protein [Dendrocalamus l... 1116
1.7e-113 1
gi|47681329|gb|AAT37485.1| MADS6 protein [Dendrocalamus l... 1114
2.7e-113 1
gi|47681323|gb|AAT37482.1| MADS3 protein [Dendrocalamus l... 1114
2.7e-113 1
11674.m04224|LOC_Os08g41950|protein K-box region, putative 976
1.1e-98 1
gi|28630961|gb|AAO45877.1| MADS5 [Lolium perenne] 967
1.0e-97 1
gi|44888605|gb|AAS48129.1| AGAMOUS LIKE9-like protein [Ho... 964
2.1e-97 1
11674.m04223|LOC_Os08g41950|protein K-box region, putative 899
1.6e-90 1
gi|34979580|gb|AAQ83834.1| MADS box protein [Asparagus of... 875
5.8e-88 1*
Could you please let me know if I'm missing something? Has the gi got to do
anything with this?
Thanking you,
Angshu
From cain.cshl at gmail.com Wed Jul 26 12:19:26 2006
From: cain.cshl at gmail.com (Scott Cain)
Date: Wed, 26 Jul 2006 12:19:26 -0400
Subject: [Bioperl-l] Installing staden io_lib on windows?
Message-ID: <1153930767.2632.5.camel@localhost.localdomain>
Hi all,
I'm wondering if anyone has tried to install Staden's io_lib on Windows,
and if so, how did it go? I am not much of a Windows person, but I've
tried to make it under cygwin only to get this message:
make all-recursive
make[1]: Entering directory `/home/scott/io_lib-1.9.2'
Making all in read
make[2]: Entering directory `/home/scott/io_lib-1.9.2/read'
if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../include -I../read -I../alf
-I../abi -I../ctf -I../ztr -I../plain -I../scf -I../sff -I../exp_file
-I../utils -I/usr/local/include -g -O2 -MT Read.o -MD -MP -MF
".deps/Read.Tpo" -c -o Read.o Read.c; \
then mv -f ".deps/Read.Tpo" ".deps/Read.Po"; else rm -f
".deps/Read.Tpo"; exit 1; fi
In file included from Read.h:43,
from Read.c:40:
../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or
SP_LITTLE_ENDIAN in Makefile
make[2]: *** [Read.o] Error 1
make[2]: Leaving directory `/home/scott/io_lib-1.9.2/read'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/scott/io_lib-1.9.2'
make: *** [all] Error 2
I'm guessing there is a flag I can pass to the configure script to get
the endian-ness right, but I don't know (and I don't know if this is
just the beginning of a long, fruitless road :-)
I would like to use Bio::SCF (from CPAN) in conjuction with the trace
glyph in BioGraphics to view traces in GBrowse.
Thanks for any advice,
Scott
--
------------------------------------------------------------------------
Scott Cain, Ph. D. cain.cshl at gmail.com
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060726/ae4b06a0/attachment.bin
From morissardj at gmail.com Wed Jul 26 16:49:58 2006
From: morissardj at gmail.com (leverdeterre)
Date: Wed, 26 Jul 2006 13:49:58 -0700 (PDT)
Subject: [Bioperl-l] Accessing TRANSFAC matrices
In-Reply-To:
References: <44BEA9FB.1070009@utk.edu>
Message-ID: <5510746.post@talk.nabble.com>
i'm happy for helping you
i'have done a page whitch can interrest you
http://morissardjerome.free.fr/Data/index.html
there is more information about the 397 matrix file ( in the 3 first line)
and i'm done all the logo file .
++
--
View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746
Sent from the Perl - Bioperl-L forum at Nabble.com.
From morissardj at gmail.com Wed Jul 26 17:15:19 2006
From: morissardj at gmail.com (leverdeterre)
Date: Wed, 26 Jul 2006 14:15:19 -0700 (PDT)
Subject: [Bioperl-l] Blast Output Parsing
In-Reply-To:
References:
Message-ID: <5511136.post@talk.nabble.com>
and without Bioperl i think that may help you
http://morissardjerome.free.fr/perl/blastparser.html
--
View this message in context: http://www.nabble.com/Blast-Output-Parsing-tf1974691.html#a5511136
Sent from the Perl - Bioperl-L forum at Nabble.com.
From osborne1 at optonline.net Wed Jul 26 17:00:50 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Wed, 26 Jul 2006 17:00:50 -0400
Subject: [Bioperl-l] SeqUtils
In-Reply-To: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com>
Message-ID:
Bernd,
That's easily done, changed both POD and code.
Brian O.
On 7/25/06 7:44 AM, "Bernd Web" wrote:
> Hi,
>
> With Bio::SeqUtils it may be nice to support 3 letter codes with
> capitals only, too.
> Now
>
> my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER');
>
> will give in $string->seq: XXX.
>
> Possibly the capitals in MetGlyTer are used to find the amino acids codes?
> If not maybe it's easy to implement case-insensitive, or all-capitals
> for AA codes in SeqUtils?
>
> In addition about the POD: maybe it's better not use use $string since
> Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq
> object.
>
> Regards,
> Bernd
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From osborne1 at optonline.net Wed Jul 26 17:24:34 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Wed, 26 Jul 2006 17:24:34 -0400
Subject: [Bioperl-l] Structure::IO
In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com>
Message-ID:
Bernd,
I'm not following your question. The POD in the latest Bio::Structure::Entry
shows:
=head2 chain()
Title : chain
Usage : @chains = $structure->chain($chain);
Function: Connects a Chain or a list of Chain objects to a
Bio::Structure::Entry.
Returns : List of Bio::Structure::Chain objects
Args : A Chain or a reference to an array of Chain objects
=cut
Which is not what you've copied and pasted. What version of Bioperl do you
use?
Brian O.
On 7/25/06 6:47 AM, "Bernd Web" wrote:
> Hi,
>
> Does someone have experience with Bio::Structure::IO?
> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the
> chain() method of Bio::Structure::Entry doing? The POD states:
>
> Title : chain
> Usage : @chains = $structure->chain($chain);
> Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry.
> Returns : list of Bio::Structure::Residue objects
> Args : One Residue or a reference to an array of Residue objects
>
> But in e.g
> my $stream = Bio::Structure::IO->new(-file => $filename,
> -format => 'pdb');
> while ( my $struc = $stream->next_structure() ) {
> for my $chain ($struc->get_chains) {
> my $chainid = $chain->id;
> my @chains = $struc->chain($chain);
> }
> }
>
> I get Bio::Structure::Chain=HASH(0x9f1ab50).
>
> What is the function of the chain method and how to use it?
>
> Best regards,
> bernd
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Thu Jul 27 01:06:52 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 27 Jul 2006 01:06:52 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C7A2C7.2070100@sendu.me.uk>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk>
<44C7A2C7.2070100@sendu.me.uk>
Message-ID: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
I think this looks like a great solution. You could also name
Bio::DB::Taxonomy::list as Bio::DB::Taxonomy::inmemory because it
really isn't much else than an in-memory database (of limited content
if you populate it from flat-file sequence annotation).
The only reservation I have is that you'd have methods on Node that
don't really operate on the node instance but rather operate on the
taxonomy (database) behind the scenes. That's what I would have used
Bio::Taxonomy for, not so much as a container than as a class with
(conceptually) 'static' methods corresponding to those that are now
in Node, like get_Lineage_Nodes(). They would optionally accept a
db_handle too, or use a default one set as an attribute.
However, leaving/having these methods on Node really isn't such a big
deal and I'm sure would even be preferred by many people for the sake
of simplicity.
So overall I think you should just go ahead.
-hilmar
On Jul 26, 2006, at 1:13 PM, Sendu Bala wrote:
>
> The fine details of the following may be slightly off, but it's
> just to
> provide an example. I'll use Test.pm syntax.
>
> my @human = qw('Homo sapiens' Homo Mammalia Eukaryota);
> my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota);
>
>
> [...]
> Proposed way with Node
> ----------------------
>
> my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human);
> my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens');
> $db->add_lineage(@mouse); # or make a new db
> my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus');
>
> @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
> ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota";
> # works as expected
>
> my $lca = $h_node->get_LCA_Node($m_node);
> ok $lca->scientific_name, 'Mammalia'; # works first time
>
> # try again with entrez - just change the db_handle
> $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez');
>
> @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes;
> ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group,
> Hominidae, ...";
>
> $lca = $h_node->get_LCA_Node($m_node);
> ok $lca->scientific_name, 'Mammalia';
>
> [...]
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From bix at sendu.me.uk Thu Jul 27 03:07:22 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 08:07:22 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk>
<86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
Message-ID: <44C8662A.3080904@sendu.me.uk>
Hilmar Lapp wrote:
> The only reservation I have is that you'd have methods on Node that
> don't really operate on the node instance but rather operate on the
> taxonomy (database) behind the scenes. That's what I would have used
> Bio::Taxonomy for, not so much as a container than as a class with
> (conceptually) 'static' methods corresponding to those that are now
> in Node, like get_Lineage_Nodes().
Yes, I had the same reservation. But it somehow seemed reasonable for me
to ask a node for its lineage, though I draw the line at having a method
like get_node('rank_name'). That's the only thing Bio::Taxonomy would
have been good for, so it's a trade off between some nice methods and
the problems inherent in a node-container class.
Though, perhaps we almost have the best of both worlds, since the
database is effectively a container without the problems:
$node->db_handle->get_Taxonomy_Node(-rank 'rank_name',
-lineage_of => $node); ?
> So overall I think you should just go ahead.
Great, will do.
From maximilianh at gmail.com Thu Jul 27 04:56:44 2006
From: maximilianh at gmail.com (Maximilian Haeussler)
Date: Thu, 27 Jul 2006 10:56:44 +0200
Subject: [Bioperl-l] TRANSFAC matrices, open acces
Message-ID: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com>
Actually, the fact that the transfac matrices are belonging to a
company is quite inconvenient for biologists and bioinformatics
analyses working in this field.
There are some projects to annotate cis-sequences in regular intervals
by volunteers and put the data into the public domain, one of them is
the oreganno database http://www.oreganno.org/. Its first annotation
jamboree will be held in Gent at the end of this year.
If you're interested in cis-sequences, want to meet others that are
and are willing to contribute some annotation efforts, don't hestitate
to come to gent, it's conveniently placed in the middle of europe and
registration costs almost nothing.
http://www.dmbr.ugent.be/bioit/contents/regcreative/
One day, hopefully, journals will oblige authors to put their
sequences in a common format into genbank but as long as regulation is
not seen as an important part of genome annotation, a lot manual
annotation will have to be done.
cheers
max
> On 26/07/06, leverdeterre wrote:
> >
> > i'm happy for helping you
> > i'have done a page whitch can interrest you
> > http://morissardjerome.free.fr/Data/index.html
> >
> > there is more information about the 397 matrix file ( in the 3 first line)
> > and i'm done all the logo file .
> >
> > ++
> > --
> > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746
> > Sent from the Perl - Bioperl-L forum at Nabble.com.
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
--
Maximilian Haeussler,
CNRS/INRA Gif-sur-Yvette, France
tel: +33 6 12 82 76 16
skype: maximilianhaeussler
From morissardj at gmail.com Thu Jul 27 05:10:19 2006
From: morissardj at gmail.com (leverdeterre)
Date: Thu, 27 Jul 2006 02:10:19 -0700 (PDT)
Subject: [Bioperl-l] Accessing TRANSFAC matrices
In-Reply-To: <5510746.post@talk.nabble.com>
References: <44BEA9FB.1070009@utk.edu>
<5510746.post@talk.nabble.com>
Message-ID: <5517747.post@talk.nabble.com>
Sorry i remove all this data because they are the proprity of TRANSFAC ..
http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html
The TRANSFAC? database is free for users from non-profit organizations only.
Users from commercial enterprises have to license the TRANSFAC? database and
accompanying programs.
--
View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5517747
Sent from the Perl - Bioperl-L forum at Nabble.com.
From maximilianh at gmail.com Thu Jul 27 04:44:47 2006
From: maximilianh at gmail.com (Maximilian Haeussler)
Date: Thu, 27 Jul 2006 10:44:47 +0200
Subject: [Bioperl-l] Accessing TRANSFAC matrices
In-Reply-To: <5510746.post@talk.nabble.com>
References: <44BEA9FB.1070009@utk.edu>
<5510746.post@talk.nabble.com>
Message-ID: <76f031ae0607270144of6ff9cbtbd9f3045bbc4e6e1@mail.gmail.com>
I'm pretty sure that you are not allowed to distribute these matrices:
http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html
[well...but if you don't care and biobase doesn't complain...
actually anyone can scrape the matrices from the website with wget.]
max
On 26/07/06, leverdeterre wrote:
>
> i'm happy for helping you
> i'have done a page whitch can interrest you
> http://morissardjerome.free.fr/Data/index.html
>
> there is more information about the 397 matrix file ( in the 3 first line)
> and i'm done all the logo file .
>
> ++
> --
> View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746
> Sent from the Perl - Bioperl-L forum at Nabble.com.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
From bix at sendu.me.uk Thu Jul 27 05:55:01 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 10:55:01 +0100
Subject: [Bioperl-l] TRANSFAC matrices, open acces
In-Reply-To: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com>
References: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com>
Message-ID: <44C88D75.7040102@sendu.me.uk>
Maximilian Haeussler wrote:
> Actually, the fact that the transfac matrices are belonging to a
> company is quite inconvenient for biologists and bioinformatics
> analyses working in this field.
The public version is adequate though. It would certainly be useful to
have Bioperl access to transfac and other regulation databases. I'll
probably write some suitable modules if no one beats me to it.
From sdavis2 at mail.nih.gov Thu Jul 27 07:43:09 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 27 Jul 2006 07:43:09 -0400
Subject: [Bioperl-l] TRANSFAC matrices, open acces
In-Reply-To: <44C88D75.7040102@sendu.me.uk>
Message-ID:
On 7/27/06 5:55 AM, "Sendu Bala" wrote:
> Maximilian Haeussler wrote:
>> Actually, the fact that the transfac matrices are belonging to a
>> company is quite inconvenient for biologists and bioinformatics
>> analyses working in this field.
>
> The public version is adequate though. It would certainly be useful to
> have Bioperl access to transfac and other regulation databases. I'll
> probably write some suitable modules if no one beats me to it.
I haven't used it in a while, but the TFBS family of modules are, if I
recall correctly, bioperl-compatible, though not part of bioperl. In any
case, for those who aren't aware, it might be worth a quick look:
http://forkhead.cgb.ki.se/TFBS/
Sean
From bix at sendu.me.uk Thu Jul 27 08:01:03 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 13:01:03 +0100
Subject: [Bioperl-l] TRANSFAC matrices, open acces
In-Reply-To:
References:
Message-ID: <44C8AAFF.6060100@sendu.me.uk>
Sean Davis wrote:
>
> On 7/27/06 5:55 AM, "Sendu Bala" wrote:
>
>> Maximilian Haeussler wrote:
>>> Actually, the fact that the transfac matrices are belonging to a
>>> company is quite inconvenient for biologists and bioinformatics
>>> analyses working in this field.
>
>> The public version is adequate though. It would certainly be useful to
>> have Bioperl access to transfac and other regulation databases. I'll
>> probably write some suitable modules if no one beats me to it.
>
> I haven't used it in a while, but the TFBS family of modules are, if I
> recall correctly, bioperl-compatible, though not part of bioperl. In any
> case, for those who aren't aware, it might be worth a quick look:
Yes. It only has online access to Transfac though, and the inheritance
and returned objects are TFBS specific so you miss out on whatever
goodness there may be in the rest of bioperl.
Still, recommended to use if you want programmatic access to Transfac
matrices right now.
From bernd.web at gmail.com Thu Jul 27 06:14:13 2006
From: bernd.web at gmail.com (Bernd Web)
Date: Thu, 27 Jul 2006 12:14:13 +0200
Subject: [Bioperl-l] Structure::IO
In-Reply-To:
References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com>
Message-ID: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com>
Hi
Thanks for your notes. The text I pasted comes from
http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm
(v1.25 2006/07/04) shows a different POD.
I am trying to get annotation out of PDB. ID is not a problem, but I
would like to have the HEADER and possibly comment fields to a (FastA)
description line, but how?
Bio::Structure::Entry v.1.25 does not list the annotation method in
the POD anymore (due to a missing empty line before =head).
$struc->annotation still exists; I can get the keys but not the values
with $struc->annotation($struc->seqres) (Can't locate object method
"get_Annotations" via package "Bio::PrimarySeq").
(Example script attached).
The POD states: annotation: $obj->annotation($seq_obj). So I thought
of a PrimarySeq object to pass to annotation.
The PrimarySeq object ($struc->seqres) does not contain a description:
$struc->seqres->desc is uninitialized.
Is it possible to get annotation from header/comments fields with
Bio::Structure?
Best regards,
Bernd
On 7/26/06, Brian Osborne wrote:
> Bernd,
>
> I'm not following your question. The POD in the latest Bio::Structure::Entry
> shows:
>
> =head2 chain()
>
> Title : chain
> Usage : @chains = $structure->chain($chain);
> Function: Connects a Chain or a list of Chain objects to a
> Bio::Structure::Entry.
> Returns : List of Bio::Structure::Chain objects
> Args : A Chain or a reference to an array of Chain objects
>
> =cut
>
> Which is not what you've copied and pasted. What version of Bioperl do you
> use?
>
> Brian O.
>
>
>
> On 7/25/06 6:47 AM, "Bernd Web" wrote:
>
> > Hi,
> >
> > Does someone have experience with Bio::Structure::IO?
> > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the
> > chain() method of Bio::Structure::Entry doing? The POD states:
> >
> > Title : chain
> > Usage : @chains = $structure->chain($chain);
> > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry.
> > Returns : list of Bio::Structure::Residue objects
> > Args : One Residue or a reference to an array of Residue objects
> >
> > But in e.g
> > my $stream = Bio::Structure::IO->new(-file => $filename,
> > -format => 'pdb');
> > while ( my $struc = $stream->next_structure() ) {
> > for my $chain ($struc->get_chains) {
> > my $chainid = $chain->id;
> > my @chains = $struc->chain($chain);
> > }
> > }
> >
> > I get Bio::Structure::Chain=HASH(0x9f1ab50).
> >
> > What is the function of the chain method and how to use it?
> >
> > Best regards,
> > bernd
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
-------------- next part --------------
#!/usr/bin/perl -w
use warnings;
use strict;
use Bio::Structure::IO;
my $filename = $ARGV[0];
my $stream = Bio::Structure::IO->new( -file => $filename,
-format => 'pdb');
while ( my $struc = $stream->next_structure() ) {
print "SEQRES DESC: ", $struc->seqres->desc, "\n";
print join(" ", keys %{$struc->annotation($struc->seqres)}), "\n";
print join(" ", keys %{$struc->annotation()}), "\n";
print join(" ", values %{$struc->annotation()}), "\n"; #(partly) works
print join(" ", values %{$struc->annotation($struc->seqres)}), "\n"; #does not work
}
From bix at sendu.me.uk Thu Jul 27 09:31:54 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 14:31:54 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk>
<86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
Message-ID: <44C8C04A.8070504@sendu.me.uk>
Hilmar Lapp wrote:
>
> So overall I think you should just go ahead.
One last suggestion for discussion:
It may be appropriate is to rename Bio::Taxonomy::Node to clarify that
Node has no particular reliance on or association with Bio::Taxonomy or
the other modules in Bio/Taxonomy/.
How about calling it Bio::Taxon?
It is more obvious what to expect from something called 'Bio::Taxon'
when you know that it is the new 'Bio::Species': like Bio::Species but
for any taxon. It also makes the class 'top-level' which I think most
people are happier using; seems like things in sub-directories are more
for advanced users.
From hlapp at gmx.net Thu Jul 27 09:44:25 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 27 Jul 2006 09:44:25 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C8C04A.8070504@sendu.me.uk>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk>
<86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
<44C8C04A.8070504@sendu.me.uk>
Message-ID: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net>
I don't think the top-level or sub-directory matters at all and I
don't want anybody to get used to the notion that that may imply
anything (except possibly better thought-out structure for the sub-
directory level). For instance RichSeq is what all rich annotation
sequence format parsers return, yet it is in a sub-directory.
I don't any real objection to Bio::Taxon though if that's what you'd
like to name it - although, what will happen to the Bio::Taxonomy
hierarchy then? Phased out?
-hilmar
On Jul 27, 2006, at 9:31 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>>
>> So overall I think you should just go ahead.
>
> One last suggestion for discussion:
>
> It may be appropriate is to rename Bio::Taxonomy::Node to clarify that
> Node has no particular reliance on or association with
> Bio::Taxonomy or
> the other modules in Bio/Taxonomy/.
>
> How about calling it Bio::Taxon?
>
> It is more obvious what to expect from something called 'Bio::Taxon'
> when you know that it is the new 'Bio::Species': like Bio::Species but
> for any taxon. It also makes the class 'top-level' which I think most
> people are happier using; seems like things in sub-directories are
> more
> for advanced users.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Thu Jul 27 09:48:32 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 27 Jul 2006 08:48:32 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C8662A.3080904@sendu.me.uk>
Message-ID: <002a01c6b183$59779880$15327e82@pyrimidine>
Sounds good to me; agree with Hilmar's suggestion of 'in_memory' as well,
but it's your choice.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Thursday, July 27, 2006 2:07 AM
> To: bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> Hilmar Lapp wrote:
> > The only reservation I have is that you'd have methods on Node that
> > don't really operate on the node instance but rather operate on the
> > taxonomy (database) behind the scenes. That's what I would have used
> > Bio::Taxonomy for, not so much as a container than as a class with
> > (conceptually) 'static' methods corresponding to those that are now
> > in Node, like get_Lineage_Nodes().
>
> Yes, I had the same reservation. But it somehow seemed reasonable for me
> to ask a node for its lineage, though I draw the line at having a method
> like get_node('rank_name'). That's the only thing Bio::Taxonomy would
> have been good for, so it's a trade off between some nice methods and
> the problems inherent in a node-container class.
>
> Though, perhaps we almost have the best of both worlds, since the
> database is effectively a container without the problems:
> $node->db_handle->get_Taxonomy_Node(-rank 'rank_name',
> -lineage_of => $node); ?
>
>
> > So overall I think you should just go ahead.
>
> Great, will do.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From osborne1 at optonline.net Thu Jul 27 09:44:33 2006
From: osborne1 at optonline.net (Brian Osborne)
Date: Thu, 27 Jul 2006 09:44:33 -0400
Subject: [Bioperl-l] Structure::IO
In-Reply-To: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com>
Message-ID:
Bernd,
I'll need to take a look a closer look at the POD but from your description
it seems it's wrong, or certainly incomplete. To get the HEADER line you'll
do something like:
my $stream = Bio::Structure::IO->new(-file => $filename,
-format => 'pdb');
my $struc = $stream->next_structure();
my $anncoll = $struc->annotation;
my @headers = $anncoll->get_Annotations('header');
This implies that all these top-level annotations are associated with the
entry, not with the chains. I don't use Bio::Structure so don't assume this
is true for all annotations.
There are 2 ways to explore this further. One is to look at t/StructIO.t or
other tests, useful examples are frequently found in the tests. The other is
to run your script in the debugger:
>perl -d pdb.pl 1CAM.pdb
By examining the variables your script creates using the "x" command you get
to see exactly where strings are stored and what the names of the keys are,
this is how I found the HEADER line. Type "h" for the debugger's Help.
Brian O.
On 7/27/06 6:14 AM, "Bernd Web" wrote:
> Hi
>
> Thanks for your notes. The text I pasted comes from
> http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm
> (v1.25 2006/07/04) shows a different POD.
>
> I am trying to get annotation out of PDB. ID is not a problem, but I
> would like to have the HEADER and possibly comment fields to a (FastA)
> description line, but how?
>
> Bio::Structure::Entry v.1.25 does not list the annotation method in
> the POD anymore (due to a missing empty line before =head).
> $struc->annotation still exists; I can get the keys but not the values
> with $struc->annotation($struc->seqres) (Can't locate object method
> "get_Annotations" via package "Bio::PrimarySeq").
> (Example script attached).
>
> The POD states: annotation: $obj->annotation($seq_obj). So I thought
> of a PrimarySeq object to pass to annotation.
>
> The PrimarySeq object ($struc->seqres) does not contain a description:
> $struc->seqres->desc is uninitialized.
>
> Is it possible to get annotation from header/comments fields with
> Bio::Structure?
>
> Best regards,
> Bernd
>
>
> On 7/26/06, Brian Osborne wrote:
>> Bernd,
>>
>> I'm not following your question. The POD in the latest Bio::Structure::Entry
>> shows:
>>
>> =head2 chain()
>>
>> Title : chain
>> Usage : @chains = $structure->chain($chain);
>> Function: Connects a Chain or a list of Chain objects to a
>> Bio::Structure::Entry.
>> Returns : List of Bio::Structure::Chain objects
>> Args : A Chain or a reference to an array of Chain objects
>>
>> =cut
>>
>> Which is not what you've copied and pasted. What version of Bioperl do you
>> use?
>>
>> Brian O.
>>
>>
>>
>> On 7/25/06 6:47 AM, "Bernd Web" wrote:
>>
>>> Hi,
>>>
>>> Does someone have experience with Bio::Structure::IO?
>>> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the
>>> chain() method of Bio::Structure::Entry doing? The POD states:
>>>
>>> Title : chain
>>> Usage : @chains = $structure->chain($chain);
>>> Function: Connects a (or a list of) Chain objects to a
>>> Bio::Structure::Entry.
>>> Returns : list of Bio::Structure::Residue objects
>>> Args : One Residue or a reference to an array of Residue objects
>>>
>>> But in e.g
>>> my $stream = Bio::Structure::IO->new(-file => $filename,
>>> -format => 'pdb');
>>> while ( my $struc = $stream->next_structure() ) {
>>> for my $chain ($struc->get_chains) {
>>> my $chainid = $chain->id;
>>> my @chains = $struc->chain($chain);
>>> }
>>> }
>>>
>>> I get Bio::Structure::Chain=HASH(0x9f1ab50).
>>>
>>> What is the function of the chain method and how to use it?
>>>
>>> Best regards,
>>> bernd
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
From aaron.j.mackey at gsk.com Thu Jul 27 08:54:05 2006
From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com)
Date: Thu, 27 Jul 2006 08:54:05 -0400
Subject: [Bioperl-l] Installing staden io_lib on windows?
In-Reply-To: <1153930767.2632.5.camel@localhost.localdomain>
Message-ID:
Hi Scott,
> In file included from Read.h:43,
> from Read.c:40:
> ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or
> SP_LITTLE_ENDIAN in Makefile
os.h has a bunch of #ifdef statements that check for platforms, and there
isn't one for cygwin (but there is for MinGW)
Try running configure with "--CFLAGS=-DSP_LITTLE_ENDIAN" or somesuch
Also take a look at the MinGW section of os.h to see if there are others
you will likely need (e.g. NOPIPE, NOLOCKF, etc)
Alternatively, you may want to just edit the original os.h to duplicate
the MinGW section with the appropriate compiler constant for CYGWIN
(__CYGWIN__ I'm guessing, but don't really know for sure).
Good luck,
-Aaron
From bix at sendu.me.uk Thu Jul 27 10:06:23 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 15:06:23 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk>
<86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
<44C8C04A.8070504@sendu.me.uk>
<077A86AB-174D-426E-B65F-0954A72A6019@gmx.net>
Message-ID: <44C8C85F.2010104@sendu.me.uk>
Hilmar Lapp wrote:
> I don't think the top-level or sub-directory matters at all and I don't
> want anybody to get used to the notion that that may imply anything
> (except possibly better thought-out structure for the sub-directory
> level). For instance RichSeq is what all rich annotation sequence format
> parsers return, yet it is in a sub-directory.
Well, I'm not aware that I've ever used a RichSeq ;). But your point is
taken.
> I don't any real objection to Bio::Taxon though if that's what you'd
> like to name it - although, what will happen to the Bio::Taxonomy
> hierarchy then? Phased out?
At the moment it seems to me that the Bio::Taxonomy modules (excluding
Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which
tests Taxon and Tree:
## I am pretty sure this module is going the way of the dodo bird so
## I am not sure how much work to put into fixing the tests/module
FactoryI is strange (it isn't intended to work like any other Bioperl
factory) and there are no implementers of it, while Taxonomy.pm itself
would be redundant after my Node changes and has implementation issues,
though it may make more sense to some people.
My vote is phase out.
What is the actual process involved in renaming a module in Bioperl?
From hlapp at gmx.net Thu Jul 27 10:29:09 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 27 Jul 2006 10:29:09 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C8C85F.2010104@sendu.me.uk>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk>
<86D32350-0412-40AC-8C34-51DCD3064711@gmx.net>
<44C8C04A.8070504@sendu.me.uk>
<077A86AB-174D-426E-B65F-0954A72A6019@gmx.net>
<44C8C85F.2010104@sendu.me.uk>
Message-ID:
How do you mean 'process'? You create a new module, and then you
deprecate the ones you're phasing out. If possible you rewrite the
implementation to use the new module.
Not sure this answers your question?
-hilmar
On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> I don't think the top-level or sub-directory matters at all and I
>> don't
>> want anybody to get used to the notion that that may imply anything
>> (except possibly better thought-out structure for the sub-directory
>> level). For instance RichSeq is what all rich annotation sequence
>> format
>> parsers return, yet it is in a sub-directory.
>
> Well, I'm not aware that I've ever used a RichSeq ;). But your
> point is
> taken.
>
>
>> I don't any real objection to Bio::Taxon though if that's what you'd
>> like to name it - although, what will happen to the Bio::Taxonomy
>> hierarchy then? Phased out?
>
> At the moment it seems to me that the Bio::Taxonomy modules (excluding
> Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t
> which
> tests Taxon and Tree:
>
> ## I am pretty sure this module is going the way of the dodo bird so
> ## I am not sure how much work to put into fixing the tests/module
>
> FactoryI is strange (it isn't intended to work like any other Bioperl
> factory) and there are no implementers of it, while Taxonomy.pm itself
> would be redundant after my Node changes and has implementation
> issues,
> though it may make more sense to some people.
>
> My vote is phase out.
>
>
> What is the actual process involved in renaming a module in Bioperl?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From cjfields at uiuc.edu Thu Jul 27 10:29:39 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 27 Jul 2006 09:29:39 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net>
Message-ID: <003101c6b189$17f5d2e0$15327e82@pyrimidine>
I'll respond to both here:
> Sendu Bala wrote:
>
> One last suggestion for discussion:
>
> It may be appropriate is to rename Bio::Taxonomy::Node to clarify that
> Node has no particular reliance on or association with Bio::Taxonomy or
> the other modules in Bio/Taxonomy/.
>
> How about calling it Bio::Taxon?
>
> It is more obvious what to expect from something called 'Bio::Taxon'
> when you know that it is the new 'Bio::Species': like Bio::Species but
> for any taxon. It also makes the class 'top-level' which I think most
> people are happier using; seems like things in sub-directories are more
> for advanced users.
Hilmar explains the namespace issue with Bioperl more concisely below.
You should still be able to use a Node in a Taxonomy, but then again you
should also be able to use a Taxon in a Taxonomy as well (by definition, a
Taxon is part of a Taxonomy as it is a taxonomic unit). The whole "looking
at this from a biologist's perspective" thing again...
http://en.wikipedia.org/wiki/Taxon
BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used
more for building taxonomic trees that anything, so shouldn't it be moved to
Bio::Tree:Taxon (that name isn't used)? Then you could use
Bio::Taxonomy::Taxon for your purposes.
See, the only concern I have with using the name Bio::Taxon is people
confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though
I agree that the name makes sense for what you want.
> Hilmar Lapp wrote:
>
> I don't think the top-level or sub-directory matters at all and I
> don't want anybody to get used to the notion that that may imply
> anything (except possibly better thought-out structure for the sub-
> directory level). For instance RichSeq is what all rich annotation
> sequence format parsers return, yet it is in a sub-directory.
>
> I don't any real objection to Bio::Taxon though if that's what you'd
> like to name it - although, what will happen to the Bio::Taxonomy
> hierarchy then? Phased out?
>
> -hilmar
I'm not sure how many people out there use Bio::Taxonomy. I think they use
the tree-building modules in Bio::Tree more than anything. And there
haven't been any panicked users protesting at the gates yet about the many
posts for Bio::Taxonomy changes (well, except me, and 'I got better').
Chris
From cjfields at uiuc.edu Thu Jul 27 10:54:06 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 27 Jul 2006 09:54:06 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C8C85F.2010104@sendu.me.uk>
Message-ID: <003201c6b18c$829330e0$15327e82@pyrimidine>
> > I don't any real objection to Bio::Taxon though if that's what you'd
> > like to name it - although, what will happen to the Bio::Taxonomy
> > hierarchy then? Phased out?
>
> At the moment it seems to me that the Bio::Taxonomy modules (excluding
> Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which
> tests Taxon and Tree:
>
> ## I am pretty sure this module is going the way of the dodo bird so
> ## I am not sure how much work to put into fixing the tests/module
>
> FactoryI is strange (it isn't intended to work like any other Bioperl
> factory) and there are no implementers of it, while Taxonomy.pm itself
> would be redundant after my Node changes and has implementation issues,
> though it may make more sense to some people.
>
> My vote is phase out.
>
>
> What is the actual process involved in renaming a module in Bioperl?
This is how many times the phrase "Bio::Taxonomy" is used in Bioperl in
directory Bio (which should catch any namespace usage like Node, etc.):
Instances: 2 BP Module : Bio::DB::Taxonomy
Instances: 4 BP Module : Bio::DB::Taxonomy::entrez
Instances: 7 BP Module : Bio::DB::Taxonomy::flatfile
Instances: 1 BP Module : Bio::Expression::Platform
Instances: 1 BP Module : Bio::SeqIO::genbank
Instances: 22 BP Module : Bio::Taxonomy
Instances: 8 BP Module : Bio::Taxonomy::FactoryI
Instances: 17 BP Module : Bio::Taxonomy::Node
Instances: 20 BP Module : Bio::Taxonomy::Taxon
Instances: 39 BP Module : Bio::Taxonomy::Tree
Hmm, not much. Almost all hits are within Bio::DB::taxonomy or
Bio::Taxonomy. The SeqIO::genbank was my change BTW; just haven't tossed
the code yet.
Therefore, the only one left that would be affected (outside of
Bio::Taxonomy and Bio::DB::Taxonomy) is Allen Day's
Bio::Expression::Platform class, which uses Bio::DB::Taxonomy::entrez to
grab Nodes; that would just be changed over to whatever class you plan on
using. And that class hasn't been documented at all outside the methods.
Furthermore, judging by the mail list archives the Bio::Taxonomy modules had
very little usage outside of Node. Jason mentioned on an old post that he
could never get Bio::Taxonomy::Taxon/Tree to work and that Dan Kortschak had
moved (Dan's last post was in 2003). Hence the test file comments.
And you make a good point with Bio::Taxonomy::FactoryI.
I agree, if the modules haven't served a useful purpose they should be
phased out.
Chris
From cjfields at uiuc.edu Thu Jul 27 11:15:25 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 27 Jul 2006 10:15:25 -0500
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
Message-ID: <003301c6b18f$7d114000$15327e82@pyrimidine>
Wow, we're doing a little bioperl spring cleaning here!
I agree with Hilmar: create a new module (Bio::Taxon), which claims the
namespace, and deprecate the old ones.
How 'broken', exactly, is Bio::Taxonomy? The idea behind it seems just
(container for Nodes) but maybe it should just be reconfigured, and all the
classes in directory Bio/Taxonomy deprecated. Or should we start from
scratch completely?
Don't know if it has been attempted but it would be nice to have a way for
building taxonomic trees from Node/Taxon information using a Taxonomy-like
container object. I like the way NCBI does something along these lines with
BLAST output now.
BTW, thanks guys for a rousing discussion!
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Thursday, July 27, 2006 9:29 AM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
>
> How do you mean 'process'? You create a new module, and then you
> deprecate the ones you're phasing out. If possible you rewrite the
> implementation to use the new module.
>
> Not sure this answers your question?
>
> -hilmar
>
> On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote:
>
> > Hilmar Lapp wrote:
> >> I don't think the top-level or sub-directory matters at all and I
> >> don't
> >> want anybody to get used to the notion that that may imply anything
> >> (except possibly better thought-out structure for the sub-directory
> >> level). For instance RichSeq is what all rich annotation sequence
> >> format
> >> parsers return, yet it is in a sub-directory.
> >
> > Well, I'm not aware that I've ever used a RichSeq ;). But your
> > point is
> > taken.
> >
> >
> >> I don't any real objection to Bio::Taxon though if that's what you'd
> >> like to name it - although, what will happen to the Bio::Taxonomy
> >> hierarchy then? Phased out?
> >
> > At the moment it seems to me that the Bio::Taxonomy modules (excluding
> > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t
> > which
> > tests Taxon and Tree:
> >
> > ## I am pretty sure this module is going the way of the dodo bird so
> > ## I am not sure how much work to put into fixing the tests/module
> >
> > FactoryI is strange (it isn't intended to work like any other Bioperl
> > factory) and there are no implementers of it, while Taxonomy.pm itself
> > would be redundant after my Node changes and has implementation
> > issues,
> > though it may make more sense to some people.
> >
> > My vote is phase out.
> >
> >
> > What is the actual process involved in renaming a module in Bioperl?
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
From hlapp at gmx.net Thu Jul 27 11:29:04 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 27 Jul 2006 11:29:04 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine>
References: <003101c6b189$17f5d2e0$15327e82@pyrimidine>
Message-ID:
On Jul 27, 2006, at 10:29 AM, Chris Fields wrote:
> See, the only concern I have with using the name Bio::Taxon is people
> confusing it with Bio::Taxonomy itself or with
> Bio::Taxonomy::Taxon. Though
> I agree that the name makes sense for what you want.
I don't think Bio::Taxonomy is used a lot in earnest if at all, so it
you even test the waters by deprecating them right away by putting
warning statements there and see whether anybody complains about the
cluttered terminal screens. If this goes into snapshot releases and
release candidates leading up to 1.6 then they may be phased out
right away.
Unless anybody on the list has strong objections? Anybody using
Bio::Taxonomy?
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From skirov at utk.edu Thu Jul 27 09:57:19 2006
From: skirov at utk.edu (skirov)
Date: Thu, 27 Jul 2006 09:57:19 -0400
Subject: [Bioperl-l] TRANSFAC matrices, open acces
Message-ID: <44E2E794@webmail.utk.edu>
Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get
it- and as far as I can tell this is not easy- you have to contact the company
to get access and it is not clear what their conditions are. This is the
reason I have decided not to maintain the transfac parser.
Stefan
>===== Original Message From Sendu Bala =====
>Sean Davis wrote:
>>
>> On 7/27/06 5:55 AM, "Sendu Bala" wrote:
>>
>>> Maximilian Haeussler wrote:
>>> Actually, the fact that the transfac matrices are belonging to a
>>> company is quite inconvenient for biologists and bioinformatics
>>> analyses working in this field.
> >
>>> The public version is adequate though. It would certainly be useful to
>>> have Bioperl access to transfac and other regulation databases. I'll
>>> probably write some suitable modules if no one beats me to it.
>>
>> I haven't used it in a while, but the TFBS family of modules are, if I
>> recall correctly, bioperl-compatible, though not part of bioperl. In any
>> case, for those who aren't aware, it might be worth a quick look:
>
>Yes. It only has online access to Transfac though, and the inheritance
>and returned objects are TFBS specific so you miss out on whatever
>goodness there may be in the rest of bioperl.
>
>Still, recommended to use if you want programmatic access to Transfac
>matrices right now.
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
From bix at sendu.me.uk Thu Jul 27 12:30:38 2006
From: bix at sendu.me.uk (Sendu Bala)
Date: Thu, 27 Jul 2006 17:30:38 +0100
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To:
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk>
Message-ID: <44C8EA2E.8030000@sendu.me.uk>
Hilmar Lapp wrote:
> How do you mean 'process'? You create a new module, and then you
> deprecate the ones you're phasing out. If possible you rewrite the
> implementation to use the new module.
>
> Not sure this answers your question?
I guess. I was thinking of just making Bio::Taxonomy::Node isa
Bio::Taxon and then simply removing all the code from Node, leaving just
some perldoc that said it had been renamed?
Or should there be some methods that issue a warning and then call SUPER?
From hlapp at gmx.net Thu Jul 27 12:38:30 2006
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 27 Jul 2006 12:38:30 -0400
Subject: [Bioperl-l] Bio::*Taxonomy* changes
In-Reply-To: <44C8EA2E.8030000@sendu.me.uk>
References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk>
<44C8EA2E.8030000@sendu.me.uk>
Message-ID: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net>
That's what I said could be possible here on much shorter notice that
we'd do usually due to the low usage.
Eventually deprecated modules should also be physically removed, so
you want to prepare for that. (removing a module breaks scripts that
used it; issuing a warning alerts to this being forthcoming.)
-hilmar
On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> How do you mean 'process'? You create a new module, and then you
>> deprecate the ones you're phasing out. If possible you rewrite the
>> implementation to use the new module.
>>
>> Not sure this answers your question?
>
> I guess. I was thinking of just making Bio::Taxonomy::Node isa
> Bio::Taxon and then simply removing all the code from Node, leaving
> just
> some perldoc that said it had been renamed?
>
> Or should there be some methods that issue a warning and then call
> SUPER?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From sanges at biogem.it Thu Jul 27 12:37:05 2006
From: sanges at biogem.it (Remo Sanges)
Date: Thu, 27 Jul 2006 18:37:05 +0200
Subject: [Bioperl-l] TRANSFAC matrices, open acces
In-Reply-To: <44E2E794@webmail.utk.edu>
References: <44E2E794@webmail.utk.edu>
Message-ID: <44C8EBB1.5070709@biogem.it>
Here is also my 2c on TFBS:
skirov wrote:
>Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get
>it- and as far as I can tell this is not easy- you have to contact the company
>to get access and it is not clear what their conditions are. This is the
>reason I have decided not to maintain the transfac parser.
>Stefan
>
>
>>===== Original Message From Sendu Bala =====
>>Sean Davis wrote:
>>
>>
>>>On 7/27/06 5:55 AM, "Sendu Bala"