[Bioperl-l] Parsing EMBL DR lines with 1 accession
James Abbott
j.abbott at imperial.ac.uk
Tue Mar 16 09:28:30 EST 2004
Greetings bioperlers...
I have been using Bio::SeqIO to parse EMBL files, and noticed that some
of the database cross-references (DR lines) were missing from the
returned RichSeq object. The missing references were to the GOA
database, which only have a primary id - the secondary id/accession
usually found in (for example) swissprot/trembl references is missing
i.e. (from EMBL:AE000562)
DR GOA; O25226.
DR GOA; P96551.
DR SPTREMBL; O25217; O25217.
DR SPTREMBL; O25218; O25218.
These SPTREMBL cross references are parsed fine, however the GOA
references are skipped. Looking at the code in question in
Bio::SeqIO::embl, although there is provision for dbxrefs with a single
id, the regex requires a trailing ';' after the primary accession. I
have included a diff against embl.pm v 1.72 (see below...) which alters
this regex to optionally match the second ';', allowing the id in DR
lines with a single accesion to be parsed as the primary accession.
Writing these entries back out again using Bio::SeqIO::embl results in
these DR lines appearing as:
DR GOA; 025226; .
however the examples given in the EMBL user manual, and all those I've
found in EMBL, lack this second ';' and following whitespace present.
The second modification in the diff modifies the behaviour of
Bio::SeqIO::embl when there is secondary accession present, ensuring
that the DR line is written out as
DR GOA; 025226.
Cheers,
James
--
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London
*** embl.pm Tue Mar 16 13:52:35 2004
--- embl1.72.pm Tue Mar 16 13:00:00 2004
***************
*** 594,600 ****
my $prim = $dr->primary_id;
my $opt = $dr->optional_id || '';
! my $line = $opt ? "$db_name; $prim; $opt." :
"$db_name; $prim.";
$self->_write_line_EMBL_regex("DR ", "DR ",
$line, '\s+|$', 80); #'
}
$self->_print("XX\n");
--- 594,600 ----
my $prim = $dr->primary_id;
my $opt = $dr->optional_id || '';
! my $line = "$db_name; $prim; $opt.";
$self->_write_line_EMBL_regex("DR ", "DR ",
$line, '\s+|$', 80); #'
}
$self->_print("XX\n");
***************
*** 919,925 ****
while (defined( $_ ||= $self->_readline )) {
if (my($databse, $prim_id, $sec_id)
! = /^DR ([^\s;]+);\s*([^\s;]+);?\s*([^\s;]+)?\.$/) {
my $link = Bio::Annotation::DBLink->new();
$link->database ( $databse );
$link->primary_id ( $prim_id );
--- 919,925 ----
while (defined( $_ ||= $self->_readline )) {
if (my($databse, $prim_id, $sec_id)
! = /^DR ([^\s;]+);\s*([^\s;]+);\s*([^\s;]+)?\.$/) {
my $link = Bio::Annotation::DBLink->new();
$link->database ( $databse );
$link->primary_id ( $prim_id );
More information about the Bioperl-l
mailing list