[Bioperl-l] Length of ID in EMBL sequence entries

simon andrews (BI) simon.andrews at bbsrc.ac.uk
Thu Nov 4 06:11:36 EST 2004



> -----Original Message-----
> From: Daniel Lang [mailto:daniel.lang at biologie.uni-freiburg.de] 
> Sent: 04 November 2004 08:38
> To: Bioperl-List
> Subject: [Bioperl-l] Length of ID in EMBL sequence entries
> 
> 
> Hi,
> I just stumbled over my sequence IDs getting trimmed to 10 
> characters when writing with Bio::SeqIO::embl. line 453: 
> $temp_line = sprintf("%-11.10sstandard; $mol; $div; %d BP.", 
> $seq->id(), $len);

It's one of the fixes in:

http://bugzilla.bioperl.org/show_bug.cgi?id=1618

..the problem was that the original format if given an 11 character ID
code would allow that to run directly into the dataclass field on the ID
line, which caused files generated in this way to not be recognised by a
number of analysis programs.  There have been a couple of previous posts
on this list which were caused by this formatting issue.

Looking back through the EMBL manual at:

http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html

I don't see any clear guidance about exact posisionings in the ID line.
All their examples show the dataclass (usually "standard") starting 11
characters from the beginning of the entryname so the fix included kept
this distance, but limited the entryname to 10 chars to enforce a space
between fields.  The specification seems to include a space between
entryname and dataclass, but that would limit us to 10 char entrynames.

We could code this as:

$temp_line = sprintf("%-12.11sstandard; $mol; $div; %d BP.",

..which would still allow an 11 char entryname, but would move the rest
of the line along (which still looks like it conforms to the
specification), but might this break other things?

Does anyone have a definitive answer about the correct way to do this?

Simon.



 




More information about the Bioperl-l mailing list