[Bioperl-l] Bio::SeqIO Genbank parsing bug?
CHALFANT_CHRIS_M@Lilly.com
CHALFANT_CHRIS_M@Lilly.com
Mon, 15 Jul 2002 09:43:51 -0500
While parsing the Genbank record for GI:1710638, I discovered that
Bio::SeqIO was dropping the VERSION line. Here is the VERSION line for
this record:
VERSION P51449 GI:1710638
Here is the regex that parses the VERSION line:
#Version number
if( /^VERSION\s+(\S+)\.(\d+)\s*(GI:\d+)?/ ) {
$seq->seq_version($2);
$seq->primary_id(substr($3, 3)) if($3);
}
It appears that this regex requires that the accession number in the
VERSION line have a "dot-version" extension. This requirement causes the
parser to miss the VERSION lines in records without "dot-version"
extensions in the accession and leaves $seq->accession undefined.
I verified this behavior by changing a local copy of the record for
1710638 to read:
VERSION P51449.1 GI:1710638
I then parsed the altered copy with Bio::SeqIO. The VERSION line was
parsed correctly this time.
Should the regex be changed to include files which do not have
"dot-version" extensions?
Chris
Chris Chalfant
Bioinformatics
Eli Lilly and Company
317-433-3407