Bioperl: Any non-redundant database tools out there ???
Steve A. Chervitz
sac@alberich.Stanford.EDU
Fri, 28 Aug 1998 04:07:02 -0700 (PDT)
Gordon,
I have some code that you may find useful. It was from an experiment
to test Jarkko Hietaniemi's String::Approx.pm for use with biosequences
(generally it works pretty well, but is a little buggy). It can also
cluster all unique sequence from a set (use the -noneighb option). It reads
Fasta-formatted sequences only.
http://genome-www.stanford.edu/perlOOP/bioperl/bin/cluster_seq.pl
This script requires some modules that are included with my Blast
distribution, as well as String::Approx.pm from CPAN (if you want to do
approximate matching).
I'd be interested in any feedback you might have if you try it out.
Steve Chervitz
sac@genome.stanford.edu
On Thu, 27 Aug 1998, Ewan Birney wrote:
>
> Gordon posted this is to 'guts' but it seems much more
> appropiate to post the main mailing list, hence I am
> forwarding it.
>
>
>
> Ewan Birney
> <birney@sanger.ac.uk>
> http://www.sanger.ac.uk/Users/birney/
>
> ---------- Forwarded message ----------
> Date: Thu, 27 Aug 1998 11:11:50 -0500
> From: Gordon D. Pusch <pusch@mcs.anl.gov>
> To: vsns-bcd-perl-guts@lists.uni-bielefeld.de
> Subject: Bioperl-guts: Any non-redundant database tools out there ???
>
> Hi --- I am trying to construct a ``non-redundant'' version of WIT's
> sequence database. An obvious stupid-but-simple way to do this would
> be to use the sequence itself as the key to a hash of ID lists.
>
> However, since there are a LOT of sequences, the whole thing obviously
> won't fit into memory and we will have to store the hash as a Berkeley-DB;
> and off course, some of the sequences are quite long. I worry about such
> enormously long keys ``breaking'' something in either perl5 or Berkeley-DB's
> hash routines ---I gather they are stored internally as B-trees, so I
> could easily imagine very long keys producing stack-overflows during a
> tree traversal if the trees got too deep... :-(
>
> Has anyone on this list implemented a non-redundant database-builder
> in perl ???
>
> Does anyone know if there =IS= there a limit as to how long a hash-key
> can be for either perl5 or Berkeley-DB ??? If so, what are the usual
> failure-modes ???
>
> Can anyone suggest a more elegant algorithm than the ``stupid-but-simple''
> method outlined above ???
>
>
> Thanks in advance,
>
> -- Gordon D. Pusch <pusch@mcs.anl.gov>
>
> Disclaimer: I'm a consultant collaborating with Argonne researchers;
> I don't speak for ANL or the DOE --- and they *certainly* don't speak
> for =ME= !!!
>
> Claimer: I report =ALL= SPAMvertisers to their ISP --- =NO= exceptions !!!
>
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl-guts.html
> ====================================================================
>
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
>
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================