[Bioperl-l] need help ??parse AcNum from fasta?

Nathan S. Haigh n.haigh at sheffield.ac.uk
Tue Oct 2 09:56:49 UTC 2007


outaleb Issame wrote:
> hi,
> with this file i mean, i picked out this Accession Number from
> IPI-Human Dbase,they come from a fasta file,
> so they re under eachother like a i a table in separate file now.
> what i want is how how can i check it in the fasta File (so in the
> IPI-Human FAsta File), i they re really there;
> if yes please copy the entire entry of this Number (>....the sequence
> also)in new fasta file.so that i get at the end a new
> FASTA file with jus this IPI Accession Number.
> thx and hope was clearly.

Ok, first of all, I'd read the contents of your Accession numbers into a
hash, something like the following (this could be written in a shorter
form, but since you're a newbie I'll leave it in a longer form so you
can follow easier).

-- start script --
use strict;
use Bio::SeqIO;

# change the following three lines to point to the relevant paths
# of your list of accessions file, your fasta file and your output
# fasta file
my $acc_file = "/path/to/your/file";
my $fasta_file_in = "/path/to/your/fasta/file";
my $fasta_file_out = "/path/to/your/fasta/output/file";

# Use a hash to keep a record of accessions we want to find
my %hash_of_req_acc;

# read all the required accessions from the file into the hash as keys
open (ACC_FILE, $acc_file) or die "Couldn't open file: $!\n";
while (<ACC_FILE>) {
  my $line = $_;
  chomp $line;
  $hash_of_req_acc{$_} = 1;
}
close ACC_FILE;

my $seqio_object_in = Bio::SeqIO->new(
  -file => $fasta_file_in,
  -format => 'fasta'
);
my $seqio_object_out = Bio::SeqIO->new(
  -file => $fasta_file_out,
  -format => 'fasta'
);

# loop through all the sequences in the fasta file
while (my $seq_object = $seqio_object_in->next_seq) {
  # get the sequence accession for easy matching
  my $seq_acc = $seq_object->accession_number;

  # write the sequence object to the output fasta file if we have a
matching accession
  $seqio_object_out->write_seq($seq_object) if exists
$hash_of_req_acc{$seq_acc};
}
-- end script --

I haven't tested this, but it should at least get you started. Also, the
fasta description line in the output file may not be exactly as it was
in the input fasta file - if this really matters, you may need to get
back to us. Also, if the input fasta file is huge (many thousands of
sequences) it may be wise to create an index of the fasta file in order
to speed up retrieval.

You may find this page helpful:
http://www.bioperl.org/wiki/HOWTO:SeqIO

Anyway, hope this helps to get you started.
Nath





More information about the Bioperl-l mailing list