[Bioperl-l] Getting sequences by ID
Torsten Seemann
torsten.seemann at infotech.monash.edu.au
Wed Apr 5 22:14:01 UTC 2006
On Wed, 2006-04-05 at 18:03 +0100, Yuval Itan wrote:
> I would be grateful for an advice from you regarding Bioperl, after I was
> fiddling around trying to write the Perl script for that from scratch.
> I have a large fasta file of about 20,000 genes, and another file which is a
> list of about 2,000 gene IDs (no sequences), all included in the large file.
> I need to create a fasta file which will include only the genes with these
> specific 200 IDs. I was wondering if there is a method in Bioperl that will
> allow me to do the following pseudocode:
>
> For each $ID from 200_IDs_set_file
> {
> $my_seq = get_sequence_by_ID(from large_fasta_file, $ID)
> write $my_seq into file
> }
There are many possibilities involving combinations of pure Perl and
BioPerl modules, and some even involving no Perl, but rather using
commands like 'formatdb' and 'fastacmd -s'. There are probably EMBOSS
solutions too.
Using your pseudo code, you could use Bio::Index::Fasta to index your
20,000 genes. Then loop over each ID, and retrieve the Seq via the
index, and write it out using Bio::SeqIO.
Perhaps look at it from another perspective:
# put all the IDs we want into a hash (read from file)
my %want_id = .... ;
foreach $seq (use Seq::IO to read large_fasta_file) {
if $want_id{$seq->id} then
use Seq::IO to write this $seq out
end
}
--
Torsten Seemann <torsten.seemann at infotech.monash.edu.au>
Victorian Bioinformatics Consortium
More information about the Bioperl-l
mailing list