[Bioperl-l] Length of ID in EMBL sequence entries
Daniel Lang
daniel.lang at biologie.uni-freiburg.de
Thu Nov 4 12:09:32 EST 2004
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I´m wondering why the entryname length should be limited at all...
I just wrote an email to the EBI support. They should be able to solve
this...
But I fear that the standard is like "%-11.10s". I´ve checked parts of
the latest EMBL distribution:
perl -e 'while (<>) {print length $1 ,"\n" if /^ID\s+(\S+\s+)\S+/ &&
length $1 > 11;}' cum_2.dat
- --> none of the entries was above 11
Since I have accessions >11 chars, I have to maintain my own version of
embl.pm :(
I´ll post the answer from ebi.
Cheers Daniel
simon andrews (BI) wrote:
|
|>-----Original Message-----
|>From: Daniel Lang [mailto:daniel.lang at biologie.uni-freiburg.de]
|>Sent: 04 November 2004 08:38
|>To: Bioperl-List
|>Subject: [Bioperl-l] Length of ID in EMBL sequence entries
|>
|>
|>Hi,
|>I just stumbled over my sequence IDs getting trimmed to 10
|>characters when writing with Bio::SeqIO::embl. line 453:
|>$temp_line = sprintf("%-11.10sstandard; $mol; $div; %d BP.",
|>$seq->id(), $len);
|
|
| It's one of the fixes in:
|
| http://bugzilla.bioperl.org/show_bug.cgi?id=1618
|
| ..the problem was that the original format if given an 11 character ID
| code would allow that to run directly into the dataclass field on the ID
| line, which caused files generated in this way to not be recognised by a
| number of analysis programs. There have been a couple of previous posts
| on this list which were caused by this formatting issue.
|
| Looking back through the EMBL manual at:
|
| http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html
|
| I don't see any clear guidance about exact posisionings in the ID line.
| All their examples show the dataclass (usually "standard") starting 11
| characters from the beginning of the entryname so the fix included kept
| this distance, but limited the entryname to 10 chars to enforce a space
| between fields. The specification seems to include a space between
| entryname and dataclass, but that would limit us to 10 char entrynames.
|
| We could code this as:
|
| $temp_line = sprintf("%-12.11sstandard; $mol; $div; %d BP.",
|
| ..which would still allow an 11 char entryname, but would move the rest
| of the line along (which still looks like it conforms to the
| specification), but might this break other things?
|
| Does anyone have a definitive answer about the correct way to do this?
|
| Simon.
|
|
|
|
|
|
| _______________________________________________
| Bioperl-l mailing list
| Bioperl-l at portal.open-bio.org
| http://portal.open-bio.org/mailman/listinfo/bioperl-l
- --
Daniel Lang
University of Freiburg, Plant Biotechnology
Sonnenstr. 5, D-79104 Freiburg
phone: +49 761 203 6988
homepage: http://www.plant-biotech.net/
e-mail: daniel.lang at biologie.uni-freiburg.de
#################################################
My software never has bugs.
It just develops random features.
#################################################
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBimJMmJnbCpJAG3ARAvh5AJ9Zkmt2zP5AJuvgVnoQoQ9yyLEaJgCfZ2kX
vWp6xJ4c+Kua9x8z5G7jkiU=
=jY43
-----END PGP SIGNATURE-----
More information about the Bioperl-l
mailing list