[Bioperl-l] EMBL file with space before quoted, multi-line qualifier value

Adam Sjøgren adsj at novozymes.com
Thu Apr 9 12:47:47 UTC 2015


  Hi.

I just stumbled over an EMBL file that has space after equal-sign that
follows the qualifier name.

BioPerl doesn't parse folded lines when that happens, because the "-char
is expected to be the first character after the '='.

I have whittled it down to this example:

ID   TEST standard; DNA; 10 BP.
XX
AC   TEST;
XX
DT   09-APR-2015
XX
DE   Test of space before quoted qualifier.
XX
XX
FH   Key             Location/Qualifiers
FT   CDS             1..10
FT                   /*tag= x
FT                   /gene= "someT"
FT                   /product= "somewordandt extthatisquite lon
FT                   gandthereforewraps"
XX
SQ   Sequence 10 BP; 10 A; 0 C; 0 G; 9 T; 0 U; 0 Other;
     aaaaaaaaaa       10
//

And this "one"-liner:

  ~$ perl -e 'use warnings; use strict; use Bio::SeqIO; my $in=Bio::SeqIO->new("-file"=>"white_space.embl", "-format"=>"embl"); my $seq=$in->next_seq; foreach my $feature ($seq->all_SeqFeatures) { print $feature->primary_tag . "\n"; map { print "  /" . $_ . "=" . (join " ", $feature->get_tag_values($_)) . "\n"; } $feature->get_all_tags }'

  ------------- EXCEPTION: Bio::Root::Exception -------------
  MSG: Can't see new qualifier in: gandthereforewraps"
  from:
  /*tag= x
  /gene= "someT"
  /product= "somewordandt extthatisquite lon
  gandthereforewraps"

  STACK: Error::throw
  STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
  STACK: Bio::SeqIO::embl::_read_FTHelper_EMBL /usr/share/perl5/Bio/SeqIO/embl.pm:1361
  STACK: Bio::SeqIO::embl::next_seq /usr/share/perl5/Bio/SeqIO/embl.pm:400
  STACK: -e:1
  -----------------------------------------------------------

shows the error.

If I add this patch:

--- Bio/SeqIO/embl.pm.orig	2015-04-09 14:27:08.035573910 +0200
+++ Bio/SeqIO/embl.pm	2015-04-09 14:27:46.952373300 +0200
@@ -1358,7 +1358,7 @@
     # intact to provide informative error messages.)
   QUAL: for (my $i = 0; $i < @qual; $i++) {
         $_ = $qual[$i];
-        my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
+        my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
             or $self->throw("Can't see new qualifier in: $_\nfrom:\n"
                             . join('', map "$_\n", @qual));
         if (defined $value) {

then the output is:

  CDS
    /*tag=x
    /gene=someT
    /product=somewordandt extthatisquite lon gandthereforewraps

which seems more reasonable, even if the format does not allow
whitespace after the =-sign (I haven't checked).

What do you think?

  Best regards,

    Adam

-- 
                                                          Adam Sjøgren
                                                    adsj at novozymes.com



More information about the Bioperl-l mailing list