[Bioperl-l] Bug in SeqIO genbank output
Jason Stajich
jason at cgt.duhs.duke.edu
Thu Jan 1 20:16:17 EST 2004
The reason Heikki was not seeing the problems is probably because he was
doing roundtripping with a genbank file - if you start with embl or fasta
you see that the trailing 6 spaces aren't coming in. This is because of
this code
(parsing of genbank)
! if(defined($_) && s/^ORIGIN//) {
chomp;
if( $annotation && length($_) > 0 ) {
$annotation->add_Annotation('origin',
Bio::Annotation::SimpleValue->new(-value => $_));
}
changing this to
! if(defined($_) && s/^ORIGIN\s+//) {
So the $o value in the ORIGIN writer was getting set with the 6 spaces
when inputting the genbank file. This is a silly thing to store as an
annotation.
So fixing the ORIGIN problem
Index: Bio/SeqIO/genbank.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/SeqIO/genbank.pm,v
retrieving revision 1.99
diff -r1.99 genbank.pm
543c543
< if(defined($_) && s/^ORIGIN//) {
---
> if(defined($_) && s/^ORIGIN\s+//) {
819c819,820
< $self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value : ''));
---
> $self->_print(sprintf("%-12s%s\n",
> 'ORIGIN', $o ? $o->value : ''));
I also exposed an embl parsing bugs when there is no feature table that
I fixed on main trunk and also merged onto the branch.
Happy New Year.
--jason
On Fri, 2 Jan 2004, Wes Barris wrote:
> Heikki Lehvaslaiho wrote:
>
> > Wes,
> >
> > You didnot say which versionof bioperl you are using. For some reason
>
> I am using bioperl-1.2.3
>
> > which I
> > can not quite understand, the current code:
> > $self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value : ''));
> >
> > does print out the requred six spaces after the word ORIGIN. This was
> > recently
>
> Really? How? In the above line "%-6s" left justifies 'ORIGIN' (which is
> already 6 characters). The '6' needs to be changed to '12' to get six
> extra spaces. See below.
>
>
> > fixed. Now, why doesn't it work for you? Could you check that you do not
> > have
> > multiple copies of bioperl in your computer and the older one gets
> > accidently
> > executed?
> >
> > Sorry, I can not comeupwith any better explanation,
> >
> > -Heikki
> >
> > On Tuesday 16 Dec 2003 4:38 am, Wes Barris wrote:
> > > Hi,
> > >
> > > I have just succeeded in tracking down a bug that prevents genbank files
> > > written from bioperl from being properly imported into StackPack
> > > (clustering software). The problem is due to a subtle difference in
> > > a genbank entry downloaded from NCBI and a genbank entry produced using
> > > genbank.pm. If you use "od -c" to look at a genbank record from NCBI,
> > > you will notice that the word "ORIGIN" is followed by six space
> > characters.
> > >
> > > ORIGIN
> > > 1 cggccgcgtc gacttttttt ttaggtattt ttctcttatt atttctaaaa
> > > tataaatttt 61 ggacattcaa aagtgcaaca ngttaatgtg cctgtgggga atatcacagt
> > > taaaaaaata
> > >
> > > If I process this file using bioperl and then write out a new genbank
> > > format file, the word "ORIGIN" is followed immediately by a carriage
> > return
> > > (newline) character.
> > >
> > > It seems silly to me that spaces should be required after the word
> > > "ORIGIN", but they do exist in files downloaded from NCBI and StackPack
> > > seems to require these space characters in order to import a genbank
> > file.
> > > Is there an official specification for the genbank format? I have
> > sent a
> > > bug report to the makers of StackPack too.
> > >
> > > In the meantime, I have modified my installed copy of
> > Bio/SeqIO/genbank.pm
> > > changing this line:
> > >
> > > $self->_print(sprintf("%-6s%s\n",'ORIGIN',$o ? $o->value :
> > ''));
> > >
> > > to this:
> > >
> > > $self->_print(sprintf("%-12s%s\n",'ORIGIN ',$o ?
> > $o->value :
> > > ''));
> >
> > --
> > ______ _/ _/_____________________________________________________
> > _/ _/ http://www.ebi.ac.uk/mutations/
> > _/ _/ _/ Heikki Lehvaslaiho heikki_at_ebi ac uk
> > _/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
> > _/ _/ _/ Wellcome Trust Genome Campus, Hinxton
> > _/ _/ _/ Cambs. CB10 1SD, United Kingdom
> > _/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
> > ___ _/_/_/_/_/________________________________________________________
> >
>
>
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list