[Bioperl-l] Bug in SeqIO genbank output

Heikki Lehvaslaiho heikki at nildram.co.uk
Fri Jan 2 13:13:00 EST 2004


On Friday 02 Jan 2004 1:16 am, Jason Stajich wrote:
> The reason Heikki was not seeing the problems is probably because he was
> doing roundtripping with a genbank file - if you start with embl or fasta
> you see that the trailing 6 spaces aren't coming in.  This is because of

So that's what it was! I was using a genbank file. Thanks, Jason,

	-Heikki
 
> this code
> (parsing of genbank)
> !	  if(defined($_) && s/^ORIGIN//) {
> 	      chomp;
> 	      if( $annotation && length($_) > 0 ) {
> 		  $annotation->add_Annotation('origin',
> 					       Bio::Annotation::SimpleValue->new(-value => $_));
> 	      }
>         changing this to
> !         if(defined($_) && s/^ORIGIN\s+//) {
>
> So the $o value in the ORIGIN writer was getting set with the 6 spaces
> when inputting the genbank file.  This is a silly thing to store as an
> annotation.
>
> So fixing the ORIGIN problem
> Index: Bio/SeqIO/genbank.pm
> ===================================================================
> RCS file: /home/repository/bioperl/bioperl-live/Bio/SeqIO/genbank.pm,v
> retrieving revision 1.99
> diff -r1.99 genbank.pm
> 543c543
> < 	  if(defined($_) && s/^ORIGIN//) {
> ---
>
> > 	  if(defined($_) && s/^ORIGIN\s+//) {
>
> 819c819,820
> < 	$self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value : ''));
> ---
>
> > 	$self->_print(sprintf("%-12s%s\n",
> > 			      'ORIGIN', $o ? $o->value : ''));
>
> I also exposed an embl parsing bugs when there is no feature table that
> I fixed on main trunk and also merged onto the branch.
>
>
> Happy New Year.
> --jason
>
> On Fri, 2 Jan 2004, Wes Barris wrote:
> > Heikki Lehvaslaiho wrote:
> > > Wes,
> > >
> > > You didnot say which versionof bioperl you are using. For some reason
> >
> > I am using bioperl-1.2.3
> >
> > > which I
> > > can not quite understand, the current code:
> > >           $self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value :
> > > ''));
> > >
> > > does print out the requred six spaces after the word ORIGIN. This was
> > > recently
> >
> > Really?  How?  In the above line "%-6s" left justifies 'ORIGIN' (which is
> > already 6 characters).  The '6' needs to be changed to '12' to get six
> > extra spaces.  See below.
> >
> > > fixed. Now, why doesn't it work for you? Could you check that you do
> > > not have
> > > multiple copies of bioperl in your computer and the older one gets
> > > accidently
> > > executed?
> > >
> > > Sorry, I can not comeupwith any better explanation,
> > >
> > >         -Heikki
> > >
> > > On Tuesday 16 Dec 2003 4:38 am, Wes Barris wrote:
> > >  > Hi,
> > >  >
> > >  > I have just succeeded in tracking down a bug that prevents genbank
> > >  > files written from bioperl from being properly imported into
> > >  > StackPack (clustering software).  The problem is due to a subtle
> > >  > difference in a genbank entry downloaded from NCBI and a genbank
> > >  > entry produced using genbank.pm.  If you use "od -c" to look at a
> > >  > genbank record from NCBI, you will notice that the word "ORIGIN" is
> > >  > followed by six space
> > >
> > > characters.
> > >
> > >  > ORIGIN
> > >  >          1 cggccgcgtc gacttttttt ttaggtattt ttctcttatt atttctaaaa
> > >  > tataaatttt 61 ggacattcaa aagtgcaaca ngttaatgtg cctgtgggga atatcacagt
> > >  > taaaaaaata
> > >  >
> > >  > If I process this file using bioperl and then write out a new
> > >  > genbank format file, the word "ORIGIN" is followed immediately by a
> > >  > carriage
> > >
> > > return
> > >
> > >  > (newline) character.
> > >  >
> > >  > It seems silly to me that spaces should be required after the word
> > >  > "ORIGIN", but they do exist in files downloaded from NCBI and
> > >  > StackPack seems to require these space characters in order to import
> > >  > a genbank
> > >
> > > file.
> > >
> > >  > Is there an official specification for the genbank format?  I have
> > >
> > > sent a
> > >
> > >  > bug report to the makers of StackPack too.
> > >  >
> > >  > In the meantime, I have modified my installed copy of
> > >
> > > Bio/SeqIO/genbank.pm
> > >
> > >  > changing this line:
> > >  >
> > >  >          $self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value :
> > >
> > > ''));
> > >
> > >  > to this:
> > >  >
> > >  >          $self->_print(sprintf("%-12s%s\n",'ORIGIN      ',$o ?
> > >
> > > $o->value :
> > >  > ''));
> > >
> > > --
> > > ______ _/      _/_____________________________________________________
> > >       _/      _/                      http://www.ebi.ac.uk/mutations/
> > >      _/  _/  _/  Heikki Lehvaslaiho    heikki_at_ebi ac uk
> > >     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
> > >    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
> > >   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
> > >      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> > > ___ _/_/_/_/_/________________________________________________________
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki_at_ebi ac uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________


More information about the Bioperl-l mailing list