[Bioperl-l] Re: PARSE GENBANK SEQUENCE FILES (Melissa L. Kimball)

darin.m.london@gsk.com darin.m.london@gsk.com
Tue, 11 Jun 2002 06:21:14 -0400


                                                                                                                     
                    bioperl-l-request@b                                                                              
                    ioperl.org                                                                                       
                                                                                                                     
                    Sent by:                  To:     bioperl-l                                                      
                    bioperl-l-admin@bio                                                                              
                    perl.org                  cc:                                                                    
                                              Subject:     Bioperl-l digest, Vol 1 #760 - 10 msgs                    
                                                                                                                     
                    11-Jun-2002 04:21                                                                                
                    Please respond to                                                                                
                    bioperl-l@bioperl.o                                                                              
                    rg                                                                                               
                                                                                                                     
                                                                                                                     







Send Bioperl-l mailing list submissions to
           bioperl-l@bioperl.org

To subscribe or unsubscribe via the World Wide Web, visit
           http://bioperl.org/mailman/listinfo/bioperl-l
or, via email, send a message with subject or body 'help' to
           bioperl-l-request@bioperl.org

You can reach the person managing the list at
           bioperl-l-admin@bioperl.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioperl-l digest..."


Today's Topics:

   1. PARSE GENBANK SEQUENCE FILES (Melissa L. Kimball)
   2. Please help me with the parameters for precise alignment (ze cheng)
   3. Re: PARSE GENBANK SEQUENCE FILES (Jason Stajich)
   4. Re: PARSE GENBANK SEQUENCE FILES (Jason Stajich)
   5. Re: PARSE GENBANK SEQUENCE FILES (Stefan Kirov)
   6. Retrieving genomic sequence remotely (Stefan Kirov)
   7. Genbank seq CODE (Melissa L. Kimball)
   8. Re: Genbank seq CODE (Jason Stajich)
   9. 1.0.1 preparing to release, 1.1 plans (Jason Stajich)
  10. Re: 1.0.1 preparing to release, 1.1 plans (Heikki Lehvaslaiho)

--__--__--

Message: 1
Date: Mon, 10 Jun 2002 12:40:39 -0400
From: "Melissa L. Kimball" <mkimball@med.unc.edu>
To: <bioperl-l@bioperl.org>
Subject: [Bioperl-l] PARSE GENBANK SEQUENCE FILES

I am very new to bioperl.

I need to selectively convert *.seq files from Genbank into fasta format.
Bio::SeqIO has this capability.  Though, I am having a problem writing the
sequence to a local file.  Here is the error:


>Filehandle GEN1 opened only for input at /Library/Perl/Bio/Root/IO.pm line
305,
>        <GEN0> line 1174524 (#1)
>    (W io) You tried to write on a read-only filehandle.  If you intended
it
>    to be a read-write filehandle, you needed to open it with "+<" or "+>"
>    or "+>>" instead of with "<" or nothing.  If you intended only to
write
>    the file, use ">" or ">>".  See perlfunc/open.


I have looked in the IO.pm module to fix the problem on line 305, but I
don't know perl well enough to interpret what it is doing.  This is what is
on line 305:


>sub _print {
>    my $self = shift;
>    my $fh = $self->_fh || \*STDOUT;
>>>  print $fh @_;   <<<<---------------THIS IS LINE 305<<<<<<<------------


The other thing is, that I don't want to convert the entire *.seq file into
fasta.  I need to be selective.  The conditions lie within the annotation
part of the sequence.  I have been able to acquire the species by using:

>Bio::SeqIO->next_seq()  //RETURNS RichSeq
>RichSeq->species()      //RETURNS Species
>Species->species()      //RETURNS String

But I also need to retrieve sequences based on other information in the
annotation.  The methods provided in SeqIO, Seq, Annotation, Species, and
so
forth are not enough!  The values returned from the methods are of no use
for me.  Maybe I am not implementing them properly, I am only using the
doc.bioperl.org documentation site as my guide.  Are there any other
modules
that parse genbank *.seq files, so that I can obtain any part of the
annotation of a sequence?

SPECS
-----
OS = Mac OSX
MACHINE = POWERBOOK
PERL = v. 5.8.0
BIOPERL = 1.0 stable (I think)

Thanks for all the help in advance ;-)

.........................................................
Melissa Kimball
University of North Carolina
Giddings Bioinformatics Group - Microbiology & Immunology
mkimball@med.unc.edu
.........................................................


Here is a script I wrote to split a directory of zipped genbank files, and
parse them into a directory of genbank files and a directory of fasta
files, each with splits based on whether the seqs are genomic, non-genomic,
and est X species.  It also writes out the fasta files with a different
headerline than the standard bioperl defined header.  I offer it as a
demonstration of many of the features of the SeqIO - Seq system.

sub keyWordParse {

  return sub {
      my $keyword = shift;

      return speciesParse($outHASH{GENOMIC}) if ($keyword =~ m/HTG|GSS/);
      return speciesParse($outHASH{EST}) if ($keyword =~ m/EST/);
      return speciesParse($outHASH{NONGENOMIC});
  }
}

sub speciesParse {
  my $outHashWithKeyword = shift;

  return sub {
        my $species = shift;

        return ( defined( $outHashWithKeyword->{uc($species)} ) )?writeSeq
($outHashWithKeyword->{uc($species)}):writeSeq
($outHashWithKeyword->{OTHER});
  }
}

sub writeSeq {
 my $outHashFullyQualified = shift;

 return sub {
      my $seqOBJ = shift;

      my $displayid = sprintf("%s|%s %s [%s]", $seqOBJ->accession_number,
$seqOBJ->accession_number, $seqOBJ->desc, $seqOBJ->species->binomial);
      my $pseq = Bio::PrimarySeq->new( '-seq' => $seqOBJ->seq, '
-display_id' => $displayid);

      $outHashFullyQualified->{GENBANK}->write_seq($seqOBJ);
      $outHashFullyQualified->{FASTA}->write_seq($pseq);
   }
}


### MAIN

use strict;
use Getopt::Long;
use Bio::SeqIO;
use IO::File;

my $sourcedir;
my $fastadir;
my $gbdir;
my $usage = "usage:\ngbsplitter.pl -sourcedir
path_to_dir_with_genbank_files -fastadir path_to_dir_for_fasta -genbankdir
path_to_dir_for_genbank_files\n\nProgram writes out human, mouse, rat, and
other X est genomic and nongenomic splits in fastadir and genbankdir\n";

&GetOptions( 'sourcedir:s' => \$sourcedir,
             'fastadir:s' => \$fastadir,
             'genbankdir:s' => \$gbdir
           );

unless ($sourcedir and $fastadir and $gbdir) {
      die $usage;
    }

# global variable outHASH, which allows the closures above to reference the
hash.
our %outHASH = (
           GENOMIC => {(
                 qq/HOMO SAPIENS/ => {(
                    GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_genomic_human", '-format' => 'GenBank'),
                                FASTA => Bio::SeqIO->new('-file' => ">
${fastadir}\/genbank_genomic_human", '-format' => 'fasta')
                  )},
                  qq/RATTUS NORVEGICUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_genomic_rat", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_genomic_rat", '-format' => 'fasta')
                  )},
               qq/MUS MUSCULUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_genomic_mouse", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_genomic_mouse", '-format' => 'fasta')
                  )},
            qq/OTHER/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_genomic_other", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_genomic_other", '-format' => 'fasta')
                  )}
            )},
            NONGENOMIC => {(
                 qq/HOMO SAPIENS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_nongenomic_human", '-format' => 'GenBank'),
                        FASTA =>  Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_nongenomic_human", '-format' => 'fasta')
           )},
                 qq/RATTUS NORVEGICUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_nongenomic_rat", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_nongenomic_rat", '-format' => 'fasta')
           )},
                 qq/MUS MUSCULUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_nongenomic_mouse", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_nongenomic_mouse", '-format' => 'fasta')
              )},
                 qq/OTHER/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_nongenomic_other", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_nongenomic_other", '-format' => 'fasta')
           )}
            )},
        EST => {(
                 qq/HOMO SAPIENS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_est_human", '-format' => 'GenBank'),
                        FASTA =>  Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_est_human", '-format' => 'fasta')
           )},
                 qq/RATTUS NORVEGICUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\genbank_est_rat", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_est_rat", '-format' => 'fasta')
           )},
                 qq/MUS MUSCULUS/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_est_mouse", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_est_mouse", '-format' => 'fasta')
              )},
                 qq/OTHER/ => {(
                        GENBANK => Bio::SeqIO->new('-file' => ">${gbdir}
\/genbank_est_other", '-format' => 'GenBank'),
                        FASTA => Bio::SeqIO->new('-file' => ">${fastadir}
\/genbank_est_other", '-format' => 'fasta')
           )}
            )}
);

my $compressed = 0;  # when a sourcefile is compressed, this is true.
my $seqOUT = keyWordParse(); # create the closure

my @sourcefiles = `ls ${sourcedir}/*`;

foreach my $sourcefile (@sourcefiles) {
   chomp $sourcefile;
   my $file = ($sourcefile =~ m/.*\.gz$/)?"/pub/bin/gunzip -c $sourcefile
|":"<$sourcefile";
   my $inseqio = Bio::SeqIO->new('-file' => $file, '-format' => 'GenBank');

   while (my $inseq = $inseqio->next_seq ) {
     $seqOUT->($inseq->keywords)->($inseq->species->binomial)->($inseq);  #
one line does it all, with the closure in place.
   }
}
print "Parse of $sourcedir complete, see $gbdir for genbank files, and
$fastadir for fasta files\n";
exit;