[Bioperl-l] parsing of keywords field in Bio::SeqIO::genbank

Geoff Purdy geoff_purdy at yahoo.com
Tue Feb 11 06:32:31 EST 2003


We've been noticing some odd behavior with the parsing
of the 'keywords' field from a genbank flatfile in
Bio::SeqIO::genbank.  I was unable to find any
discussion of this in the docs or the bioperl-l list
archives so I was hoping someone on this list could
shed some light on the subject.

>From reading the genbank release notes regarding the
KEYWORDS field in the genbank file format and reading
the source code which parses the KEYWORDS field in
Bio::SeqIO::genbank, it appears that bioperl is
discarding information in this field.  The
specification allows for both single keywords and for
phrases delimited by semicolons.  However, it appears
that the parser discards the semicolons and treats the
entire field as a single phrase.  This can cause
problems searching this field in downstream
applications.

Is information being discarded, or am I
misunderstanding something?  Thanks.


Excerpt from GenBank Release Notes
(ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt):

"3.4.8 KEYWORDS Format The KEYWORDS field does not
appear in unannotated entries, but is required in all
annotated entries. Keywords are separated by
semicolons; a "keyword" may be a single word or a
phrase consisting of several words. Each line in the
keywords field ends in a semicolon; the last line ends
with a period. If no keywords are included in the
entry, the KEYWORDS record contains only a period."


Excerpt from BioPerl docs (
http://doc.bioperl.org/releases/bioperl-1.2/Bio/SeqIO/genbank.html
):

#Keywords
elsif( /^KEYWORDS\s+(.*)/ ) { 
   my $keywords = $1; 
   $keywords =~ s/\;//g; 
   $keywords =~ s/\.$//; # remove possibly trailing
dot
   $params{'-keywords'} = $keywords; 
} 

__________________________________________________
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com


More information about the Bioperl-l mailing list