[Bioperl-l] Parsing EMBL DR lines with 1 accession

James Abbott j.abbott at imperial.ac.uk
Tue Mar 16 09:28:30 EST 2004


Greetings bioperlers...

I have been using Bio::SeqIO to parse EMBL files, and noticed that some 
of the database cross-references (DR lines) were missing from the 
returned RichSeq object. The missing references were to the GOA 
database, which only have a primary id - the secondary id/accession 
usually found in (for example) swissprot/trembl references is missing 
i.e. (from EMBL:AE000562)

DR   GOA; O25226.
DR   GOA; P96551.
DR   SPTREMBL; O25217; O25217.
DR   SPTREMBL; O25218; O25218.

These SPTREMBL cross references are parsed fine, however the GOA 
references are skipped. Looking at the code in question in 
Bio::SeqIO::embl, although there is provision for dbxrefs with a single 
id, the regex requires a trailing ';' after the primary accession. I 
have included a diff against embl.pm v 1.72 (see below...) which alters 
this regex to optionally match the second ';', allowing the id in DR 
lines with a single accesion to be parsed as the primary accession.

Writing these entries back out again using Bio::SeqIO::embl results in 
these DR lines appearing as:

DR   GOA; 025226; .

however the examples given in the EMBL user manual, and all those I've 
found in EMBL, lack this second ';' and following whitespace present. 
The second modification in the diff modifies the behaviour of 
Bio::SeqIO::embl when there is  secondary accession present, ensuring 
that the DR line is written out as

DR   GOA; 025226.

Cheers,
James

-- 
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London


*** embl.pm     Tue Mar 16 13:52:35 2004
--- embl1.72.pm Tue Mar 16 13:00:00 2004
***************
*** 594,600 ****
                     my $prim    = $dr->primary_id;
                     my $opt     = $dr->optional_id || '';

!                   my $line = $opt ? "$db_name; $prim; $opt." : 
"$db_name; $prim.";
                     $self->_write_line_EMBL_regex("DR   ", "DR   ", 
$line, '\s+|$', 80); #'
                 }
                 $self->_print("XX\n");
--- 594,600 ----
                     my $prim    = $dr->primary_id;
                     my $opt     = $dr->optional_id || '';

!                   my $line = "$db_name; $prim; $opt.";
                     $self->_write_line_EMBL_regex("DR   ", "DR   ", 
$line, '\s+|$', 80); #'
                 }
                 $self->_print("XX\n");
***************
*** 919,925 ****
       while (defined( $_ ||= $self->_readline )) {

           if (my($databse, $prim_id, $sec_id)
!                 = /^DR   ([^\s;]+);\s*([^\s;]+);?\s*([^\s;]+)?\.$/) {
               my $link = Bio::Annotation::DBLink->new();
               $link->database   ( $databse );
               $link->primary_id ( $prim_id );
--- 919,925 ----
       while (defined( $_ ||= $self->_readline )) {

           if (my($databse, $prim_id, $sec_id)
!                 = /^DR   ([^\s;]+);\s*([^\s;]+);\s*([^\s;]+)?\.$/) {
               my $link = Bio::Annotation::DBLink->new();
               $link->database   ( $databse );
               $link->primary_id ( $prim_id );






More information about the Bioperl-l mailing list