[Bioperl-l] bug in genbank.pm
Wang, Kai
Wang.Kai@mayo.edu
Sat, 16 Feb 2002 17:30:05 -0600
I pointed out this problem about two months ago, but nobody changed it. The
new GenBank file format add a "molecular shape" in the LOCUS line so current
genbank.pm cannot process it.
in the file:
# $Id: genbank.pm,v 1.46 2002/02/14 16:41:22 jason Exp $
if (($2 eq 'bp') || defined($5)) {
if ($4 eq 'circular') {
$seq->molecule($3);
$seq->is_circular($4);
$seq->division($5);
($date) = $line =~ /.*(\d\d-\w\w\w-\d\d\d\d)/;
} else {
$seq->molecule($3);
$seq->division($4);
$date = $5;
}
} else {
$seq->molecule('PRT') if($2 eq 'aa');
$seq->division($3);
$date = $4;
}
The above code was based on the wrong assumption that NCBI will not add
'linear' tag to a record.
One example is accession number 'NM_003748'. The first line is:
LOCUS NM_003748 3134 bp mRNA linear PRI
01-NOV-2000
The current genbank.pm cannot recognize 01-NOV-2000.
I think the best way is to use: $line =~
/^LOCUS\s+(\S+)\s+\S+\s+(bp|aa)\s+(\S+)?\s+(\S+)?\s+(\w\w\w)?\s+(\d\d-\w\w\w
-\d\d\d\d)?/