[Bioperl-l] Re: [Bioperl-guts-l] [Bug 1573] New: Setting illegal ids
Heikki Lehvaslaiho
heikki at nildram.co.uk
Sun Dec 21 06:45:17 EST 2003
Valentin,
I think you are right in this. Whitespace in display_id is bad news and should
not be allowed. This is one of the many conventions in sequence formats,
however, I am a bit hesitent to extend this ban to other fields without hard
data and real need. Suggestions welcome, though.
I found these from database format documents:
EMBL:
Entryname: stable identifier, consisting of alphanumeric character,
starting with a letter. All letters should be in upper case.
SWISS-PROT:
The Swiss-Prot entry name consists of up to ten uppercase alphanumeric
characters. Swiss-Prot uses a general purpose naming convention that
can be symbolized as X_Y,
Several formats can, in principle, tolerate whitespace in IDs. A quick look
into formats identified there ones to tackle now:
fasta gcg genbank embl mase pir swiss
My only concern is that there might be some unforeseen effect if I enforce
this just before the release. I suggest that for now I add:
$self->warn("No whitespace allowed in SWISS-PROT display id [".
$seq->display_id. "]") if $seq->display_id =~ /\s/;
Setting $seq->verbose(2) before printing out will then convert this into a
throw.
A cleaner and simpler alternative would be to add a warning into value setting
code of Bio::PrimarySeq::display_id().
$self->warn("It is a REALLY bad idea to have whitespace in display_id [".
$seq->display_id. "]") if $seq->display_id =~ /\s/;
but is too intrusive?
-Heikki
P.S. open-bio.org/bioperl.org domain has been down since yesterday.
On Saturday 20 Dec 2003 12:10 pm, bugzilla-daemon at portal.open-bio.org wrote:
> http://bugzilla.bioperl.org/show_bug.cgi?id=1573
>
> Summary: Setting illegal ids
> Product: Bioperl
> Version: main-trunk
> Platform: PC
> OS/Version: Windows 2000
> Status: NEW
> Severity: enhancement
> Priority: P2
> Component: Bio::SeqIO
> AssignedTo: bioperl-guts-l at bioperl.org
> ReportedBy: valentin_ruano at yahoo.es
>
>
> In the swiss format, perhaps in some others as well, a sequence id must not
> contain blanks, aan exception is thrown when reading a
> blank-containing-idded sequence from the input stream.
>
> It is possible to set the id of a SeqI instance with blanks in it, so far
> so good since we may write this sequence in a format that stands it.
>
> The problem is that SeqIO outputting in swiss format does not complain when
> such a sequence is written into the output stream.
>
> Subsequent reading on the resulting file will throw an exception.
>
> Would not be better to provide a more strict validation step when writing
> into a swiss foramted file? Throw an exception?.
> Personally, I do not believe in converting illegal characters into legal
> ones on the fly (e.g. blank -> '_') as adopted in other modules, since this
> will silence possible programming mistakes and does not allow customisation
> (e.g. I may rather want '#' for blanks).
>
> I guess the same story may well apply to other fields.
>
> ====================
>
> Follows the exception when trying to read a seq file with blank containg
> ids:
>
> ------------- EXCEPTION -------------
> MSG: swissprot stream with no ID. Not swissprot in my book
> STACK Bio::SeqIO::swiss::next_seq
> /usr/lib/perl5/site_perl/5.8.2/Bio/SeqIO/swiss .pm:180
> STACK Bio::SeqIO::READLINE /usr/lib/perl5/site_perl/5.8.2/Bio/SeqIO.pm:640
> STACK toplevel /cygdrive/c/Program
> Files/eclipse/workspace/meb-toolbox/perl/seqr en.pl:282
>
> --------------------------------------
>
>
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> _______________________________________________
> Bioperl-guts-l mailing list
> Bioperl-guts-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l
--
______ _/ _/_____________________________________________________
_/ _/ http://www.ebi.ac.uk/mutations/
_/ _/ _/ Heikki Lehvaslaiho heikki_at_ebi ac uk
_/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
_/ _/ _/ Wellcome Trust Genome Campus, Hinxton
_/ _/ _/ Cambs. CB10 1SD, United Kingdom
_/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________
More information about the Bioperl-l
mailing list