[Bioperl-l] IMGT parsing
Alex Brown
abrown at tech.mrc.ac.uk
Fri Feb 18 06:15:32 EST 2005
Hi.
I had a small problem using BioSeqIO (in BioPerl 1.4) to parse the IMGT
flat file database - although the IMGT uses an EMBL-like format,
BioSeqIO was unable to extract display_id(), which is a bit of a
nuisance when converting between formats. This is due to a difference
between the ID line of the EMBL and the IMGT formats:
EMBL -
ID TRBG361 standard; mRNA; PLN; 1859 BP.
IMGT -
ID MMTCRGBV1 IMGT/LIGM annotation : by annotators; RNA; ROD; 290 BP.
The following modification to embl.pm seems to allowing correct parsing
of both formats :
change the lines:
$line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
embl in my book");
$line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;
$name = $1;
$mol = $2;
$div = $3;
if(! $name) {
$name = "unknown id";
}
to :
$line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
embl in my book");
# this is the new line to replace the above, allowing IMGT records
to be read as well
($name, $mol, $div) = ($line =~
/^ID\s*(\S*).*;\s*(\S*);\s*(\S*);/);
if(! $name) {
$name = "unknown id";
}
Hope this is useful.
Alex Brown.
PS. BACK-UP embl.pm before changing.
More information about the Bioperl-l
mailing list