[Bioperl-l] Getting sequences by ID
Ryan Golhar
golharam at umdnj.edu
Thu Apr 6 15:34:57 UTC 2006
Here's how I'm doing it with bioperl, but with large genbank files (such
as chromosomes) it take a while:
my $inseq = Bio::SeqIO->new(...);
while (my $seqobj = $inseq->next_seq) {
next if ($seqobj->accession ne $id);
# process the sequence here
}
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Torsten
Seemann
Sent: Wednesday, April 05, 2006 6:14 PM
To: Yuval Itan
Cc: bioperl-l at bioperl.org
Subject: Re: [Bioperl-l] Getting sequences by ID
On Wed, 2006-04-05 at 18:03 +0100, Yuval Itan wrote:
> I would be grateful for an advice from you regarding Bioperl, after I
> was
> fiddling around trying to write the Perl script for that from scratch.
> I have a large fasta file of about 20,000 genes, and another file
which is a
> list of about 2,000 gene IDs (no sequences), all included in the large
file.
> I need to create a fasta file which will include only the genes with
these
> specific 200 IDs. I was wondering if there is a method in Bioperl that
will
> allow me to do the following pseudocode:
>
> For each $ID from 200_IDs_set_file
> {
> $my_seq = get_sequence_by_ID(from large_fasta_file, $ID)
> write $my_seq into file
> }
There are many possibilities involving combinations of pure Perl and
BioPerl modules, and some even involving no Perl, but rather using
commands like 'formatdb' and 'fastacmd -s'. There are probably EMBOSS
solutions too.
Using your pseudo code, you could use Bio::Index::Fasta to index your
20,000 genes. Then loop over each ID, and retrieve the Seq via the
index, and write it out using Bio::SeqIO.
Perhaps look at it from another perspective:
# put all the IDs we want into a hash (read from file)
my %want_id = .... ;
foreach $seq (use Seq::IO to read large_fasta_file) {
if $want_id{$seq->id} then
use Seq::IO to write this $seq out
end
}
--
Torsten Seemann <torsten.seemann at infotech.monash.edu.au>
Victorian Bioinformatics Consortium
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list