[Bioperl-l] PARSE GENBANK SEQUENCE FILES

Jason Stajich jason@cgt.mc.duke.edu
Mon, 10 Jun 2002 13:15:24 -0400 (EDT)


On Mon, 10 Jun 2002, Melissa L. Kimball wrote:

> I am very new to bioperl.
>
> I need to selectively convert *.seq files from Genbank into fasta format.
> Bio::SeqIO has this capability.  Though, I am having a problem writing the
> sequence to a local file.  Here is the error:
>
We would need to see some code here to help you better - sounds like you
are not properly initializing your output handle for the fasta writer.

Here is a simple script that pulls in all the *.seq files and converts to
fasta.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

opendir(IN, "."); # opendir
foreach my $file ( readdir(IN) ) {
 next unless (/(\S+)\.seq$/);
 my $input = new Bio::SeqIO(-format =>'genbank',
			    -file   => $file);
 my $output = new Bio::SeqIO(-file => ">$1.fasta",
			     -format => 'fasta');
 while( my $seq = $input->next_seq) {
  $output->write_seq($seq);
 }
}

All done.
Now if you wanted to only print out sequences that match a criteria, like
perhaps were of a certain species as you allude to below you could augment
the while loop above like this
  while( my $seq = $input->next_seq ){
   next unless ( $seq->species->


>
> >Filehandle GEN1 opened only for input at /Library/Perl/Bio/Root/IO.pm line 305,
> >        <GEN0> line 1174524 (#1)
> >    (W io) You tried to write on a read-only filehandle.  If you intended it
> >    to be a read-write filehandle, you needed to open it with "+<" or "+>"
> >    or "+>>" instead of with "<" or nothing.  If you intended only to write
> >    the file, use ">" or ">>".  See perlfunc/open.
>
>
> I have looked in the IO.pm module to fix the problem on line 305, but I
> don't know perl well enough to interpret what it is doing.  This is what is
> on line 305:
>
>
> >sub _print {
> >    my $self = shift;
> >    my $fh = $self->_fh || \*STDOUT;
> >>>  print $fh @_;   <<<<---------------THIS IS LINE 305<<<<<<<------------
>
>
> The other thing is, that I don't want to convert the entire *.seq file into
> fasta.  I need to be selective.  The conditions lie within the annotation
> part of the sequence.  I have been able to acquire the species by using:
>
> >Bio::SeqIO->next_seq()  //RETURNS RichSeq
> >RichSeq->species()      //RETURNS Species
> >Species->species()      //RETURNS String
>

Read the documentation for the Bio::SeqI and Bio::Seq::RichSeqI methods,
you'll see that you are getting back a Bio::Species object - read the docs
forBio::Species to see how to get the whole species
name/genus/classification.

What other criteria do you want to select on to limit your output?  All
the data represented in a genbank file is accessible through the object
methods for a Bio::Seq::RichSeqI (which inherits from Bio::SeqI - so also
all the methods of a Bio::SeqI).

Sequence length - use $seq->length()

Reference annotation -
my @refs =
$seq->annotation->get_Annotation('reference');
next unless grep { $_->author =~ /MY FAVORITE AUTHOR/ } @refs;


Dates?   - my @dates = $seq->get_dates();
           next unless grep {/2002/} @dates;
Keywords - my @keywords = $seq->keywords();
	   next unless grep {/HIV/} @keywords;


> But I also need to retrieve sequences based on other information in the
> annotation.  The methods provided in SeqIO, Seq, Annotation, Species, and so
> forth are not enough!  The values returned from the methods are of no use
> for me.  Maybe I am not implementing them properly, I am only using the
> doc.bioperl.org documentation site as my guide.  Are there any other modules
> that parse genbank *.seq files, so that I can obtain any part of the
> annotation of a sequence?
>
there is only one parser - you just need to drill down to the data that is
attatched to the sequence.  If you are interested in only dumping
sequences which have certain feature annotated -
 my @features = $seq->all_SeqFeatures();
 next unless grep { $_->primary_tag =~ /exon/ } @features;


All the best - feel free to ask for better instructions if this doesn't
help.


 > SPECS
> -----
> OS = Mac OSX
> MACHINE = POWERBOOK
> PERL = v. 5.8.0
> BIOPERL = 1.0 stable (I think)
>
> Thanks for all the help in advance ;-)
>
> .........................................................
> Melissa Kimball
> University of North Carolina
> Giddings Bioinformatics Group - Microbiology & Immunology
> mkimball@med.unc.edu
> .........................................................
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu