[Bioperl-l] getting database hit sequences
Ewan Birney
birney@ebi.ac.uk
Mon, 21 Oct 2002 07:56:41 +0100 (BST)
On Mon, 21 Oct 2002, Tobias Thierer wrote:
> Hi,
>
> What is the best (or at least some) way to get the entire sequence? The
> annotation of the database hits ($hsp->query->name or so) contains
> substrings of the form "/gi=some_gi_number", so I could perhaps extract
> the GI with a regexp. But how do I get the right sequence for a specific
> GI number efficiently? Parsing the entire database for every hit is
> O(num_sequences^2) and therefore much too slow to be feasible.
Use one of the Bio::DB or Bio::Index classes. Here are your options:
- if you have your own local database, use Bio::Index::Fasta to build
a local index of the database (read the docs on Bio::Index::Fasta on how
to create an index). This should be stored on a disk which can be seen by
all your clients (ie, often good to NFS mount this)
- if you are working against EMBL, Swissprot or GenBank, use
Bio::DB::EMBL, Bio::DB::GenBank or Bio::DB::Swiss - these work across the
network and so can be pretty darn slow. Make sure you point Bio::DB::Swiss
to the nearest expasy mirror if you are using it.
- Use Bio::DB::FileCache and Bio::DB::InMemoryCache to improve
performace of the clients and cut down on the number of trips to the
network
>
> Is there any possibility to easily get the entire sequences that formed
> HSPs with my query sequence, preferable with Bioperl?
>
> Any help would be greatly appreciated!
>
> Regards,
>
> Tobias
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>