[Bioperl-l] Re: [Bioperl-guts-l] [Bug 1573] New: Setting illegal ids

Heikki Lehvaslaiho heikki at nildram.co.uk
Sun Dec 21 06:45:17 EST 2003


Valentin,

I think you are right in this. Whitespace in display_id is bad news and should 
not be allowed. This is one of the many conventions in sequence formats, 
however, I am a bit hesitent to extend this ban to other fields without hard 
data and real need. Suggestions welcome, though.

I found these from database format documents:

  EMBL:
  Entryname: stable identifier, consisting of alphanumeric character,
  starting with a letter. All letters should be in upper case.

  SWISS-PROT:
  The Swiss-Prot entry name consists of up to ten uppercase alphanumeric
  characters. Swiss-Prot uses a general purpose naming convention that
  can be symbolized as X_Y,


Several formats can, in principle, tolerate whitespace in IDs. A quick look 
into formats identified there ones to tackle now:

  fasta gcg genbank embl mase pir swiss


My only concern is that there might be some unforeseen effect if I enforce 
this just before the release. I suggest that for now I add: 

  $self->warn("No whitespace allowed in SWISS-PROT display id [".  
      $seq->display_id. "]") if $seq->display_id =~ /\s/;

Setting  $seq->verbose(2) before printing out will then convert this into a 
throw. 

A cleaner and simpler alternative would be to add a warning into value setting 
code of Bio::PrimarySeq::display_id(). 

 $self->warn("It is a REALLY bad idea to have whitespace in display_id [".  
      $seq->display_id. "]") if $seq->display_id =~ /\s/;

but is too intrusive?

	-Heikki

P.S. open-bio.org/bioperl.org domain has been down since yesterday.


On Saturday 20 Dec 2003 12:10 pm, bugzilla-daemon at portal.open-bio.org wrote:
> http://bugzilla.bioperl.org/show_bug.cgi?id=1573
>
>            Summary: Setting illegal ids
>            Product: Bioperl
>            Version: main-trunk
>           Platform: PC
>         OS/Version: Windows 2000
>             Status: NEW
>           Severity: enhancement
>           Priority: P2
>          Component: Bio::SeqIO
>         AssignedTo: bioperl-guts-l at bioperl.org
>         ReportedBy: valentin_ruano at yahoo.es
>
>
> In the swiss format, perhaps in some others as well, a sequence id must not
> contain blanks, aan exception is thrown when reading a
> blank-containing-idded sequence from the input stream.
>
> It is possible to set the id of a SeqI instance with blanks in it, so far
> so good since we may write this sequence in a format that stands it.
>
> The problem is that SeqIO outputting in swiss format does not complain when
> such a sequence is written into the output stream.
>
> Subsequent reading on the resulting file will throw an exception.
>
> Would not be better to provide a more strict validation step when writing
> into a swiss foramted file? Throw an exception?.
> Personally, I do not believe in converting illegal characters into legal
> ones on the fly (e.g. blank -> '_') as adopted in other modules, since this
> will silence possible programming mistakes and does not allow customisation
> (e.g. I may rather want '#' for blanks).
>
> I guess the same story may well apply to other fields.
>
> ====================
>
> Follows the exception when trying to read a seq file with blank containg
> ids:
>
> ------------- EXCEPTION  -------------
> MSG: swissprot stream with no ID. Not swissprot in my book
> STACK Bio::SeqIO::swiss::next_seq
> /usr/lib/perl5/site_perl/5.8.2/Bio/SeqIO/swiss .pm:180
> STACK Bio::SeqIO::READLINE /usr/lib/perl5/site_perl/5.8.2/Bio/SeqIO.pm:640
> STACK toplevel /cygdrive/c/Program
> Files/eclipse/workspace/meb-toolbox/perl/seqr en.pl:282
>
> --------------------------------------
>
>
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> _______________________________________________
> Bioperl-guts-l mailing list
> Bioperl-guts-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki_at_ebi ac uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________


More information about the Bioperl-l mailing list