[Bioperl-l] EMBL file with space before quoted, multi-line qualifier value
Adam Sjøgren
adsj at novozymes.com
Thu Apr 9 12:47:47 UTC 2015
Hi.
I just stumbled over an EMBL file that has space after equal-sign that
follows the qualifier name.
BioPerl doesn't parse folded lines when that happens, because the "-char
is expected to be the first character after the '='.
I have whittled it down to this example:
ID TEST standard; DNA; 10 BP.
XX
AC TEST;
XX
DT 09-APR-2015
XX
DE Test of space before quoted qualifier.
XX
XX
FH Key Location/Qualifiers
FT CDS 1..10
FT /*tag= x
FT /gene= "someT"
FT /product= "somewordandt extthatisquite lon
FT gandthereforewraps"
XX
SQ Sequence 10 BP; 10 A; 0 C; 0 G; 9 T; 0 U; 0 Other;
aaaaaaaaaa 10
//
And this "one"-liner:
~$ perl -e 'use warnings; use strict; use Bio::SeqIO; my $in=Bio::SeqIO->new("-file"=>"white_space.embl", "-format"=>"embl"); my $seq=$in->next_seq; foreach my $feature ($seq->all_SeqFeatures) { print $feature->primary_tag . "\n"; map { print " /" . $_ . "=" . (join " ", $feature->get_tag_values($_)) . "\n"; } $feature->get_all_tags }'
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Can't see new qualifier in: gandthereforewraps"
from:
/*tag= x
/gene= "someT"
/product= "somewordandt extthatisquite lon
gandthereforewraps"
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::embl::_read_FTHelper_EMBL /usr/share/perl5/Bio/SeqIO/embl.pm:1361
STACK: Bio::SeqIO::embl::next_seq /usr/share/perl5/Bio/SeqIO/embl.pm:400
STACK: -e:1
-----------------------------------------------------------
shows the error.
If I add this patch:
--- Bio/SeqIO/embl.pm.orig 2015-04-09 14:27:08.035573910 +0200
+++ Bio/SeqIO/embl.pm 2015-04-09 14:27:46.952373300 +0200
@@ -1358,7 +1358,7 @@
# intact to provide informative error messages.)
QUAL: for (my $i = 0; $i < @qual; $i++) {
$_ = $qual[$i];
- my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
+ my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
or $self->throw("Can't see new qualifier in: $_\nfrom:\n"
. join('', map "$_\n", @qual));
if (defined $value) {
then the output is:
CDS
/*tag=x
/gene=someT
/product=somewordandt extthatisquite lon gandthereforewraps
which seems more reasonable, even if the format does not allow
whitespace after the =-sign (I haven't checked).
What do you think?
Best regards,
Adam
--
Adam Sjøgren
adsj at novozymes.com
More information about the Bioperl-l
mailing list