[Bioperl-l] RE: SeqIO fails on masked sequences

Nathan Haigh nathanhaigh at ukonline.co.uk
Sun Jan 9 19:35:08 EST 2005


> -----Original Message-----
> From: Wes Barris [mailto:wes.barris at csiro.au]
> Sent: 09 January 2005 23:43
> To: Hilmar Lapp
> Cc: nathanhaigh at ukonline.co.uk; 'Bioperl list'; 'Brian Osborne'
> Subject: Re: [Bioperl-l] RE: SeqIO fails on masked sequences
> 
> Hilmar Lapp wrote:
> > You should not require by default that all sequences in one file be of
> > the same type (alphabet). We never have required this, nor documented
> > that it is a (not enforced) requirement, and so there may be people out
> > there relying on this 'feature'.
> 
> Mixing both DNA and protein sequences in one file and then attempting
> to process it seems like kind of a bizarre thing to want to do.  If
> the alphabet is explicitly specified, isn't there a way to make that
> take precedence?

Why are you then able to set the alphabet of a SeqIO object if whenever you call next_seq() it trys to guess the alphabet of the
sequence anyway? It seems more logical to me, that the user can specify the alphabet without worrying about bioperl guessing it, and
getting it wrong, or not setting it at all.

> 
> >
> >     -hilmar
> >
> > On Friday, January 7, 2005, at 03:39  AM, Nathan Haigh wrote:
> >
> >> There appears to be an anomaly with Bio::Seq::fasta. If the SeqIO
> >> object's alphabet is set, next_seq() results in this being undef
> >> and then proceeds to guess the alphabet again, therefore this like the
> >> following do not work:
> >>
> >> my $seq_in  = Bio::SeqIO->new(-format=>$format, -fh => \*DATA);
> >>
> >> $seq_in->alphabet('protein');
> >>
> >> Should setting the SeqIO object's alphabet be honoured even if it is
> >> set to the wrong type or the sequences are not of that
> >> alphabet?
> >>
> >>
> >>
> >> I have a bug fix, that allows you to set the alphabet through the
> >> SeqIO object, but it doesn't do any sort of checking to see if all
> >> the seqs in the object are of the correct type. Essentially, the
> >> alphabet is set in one of the following ways:
> >>
> >> 1) if the SeqIO object is set using e.g. $seq_in->alphabet('dna'); all
> >> the seqs that belong to the $seq_in object obtain their
> >> alphabet from the SeqIO object, dna in this case, irrespective of
> >> whether or not it is actually protein.
> >>
> >> 2) If alphabet has not been set in this way, the first sequence is
> >> used to guess the alphabet of the SeqIO object, from which all
> >> the sequences obtain their alphabet.
> >>
> >>
> >>
> >> Possible limitations:
> >>
> >> 1)     all seqs in the SeqIO object can only be of the same type - no
> >> testing done to see if this is not the case.
> >>
> >>
> >>
> >> Does this sound ok and reasonable?
> >>
> >> Nathan
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Brian Osborne [mailto:brian_osborne at cognia.com]
> >> Sent: 06 January 2005 12:25
> >> To: nathanhaigh at ukonline.co.uk
> >> Subject: RE: SeqIO fails on masked sequences
> >>
> >>
> >>
> >> Nathan,
> >>
> >>
> >>
> >> The idea is that a sequence with a high proportion of X is more likely
> >> to be DNA than protein. The examples I had in mind are
> >> unfinished genomic sequence, and there are countless entries in
> >> Genbank/EMBL like this. So, someone wrote in and said that their
> >> genomic sequence was being characterized as protein since the fraction
> >> [gatc] was less than 85%, it was mostly X. By contrast, there
> >> are no protein sequences with X in them in these public databases, if
> >> I'm not mistaken. So I maintain that in the world of public
> >> databases this is the way to go.
> >>
> >>
> >>
> >> Now if you venture into the world of sequence analysis it's going to
> >> be a different story, since you'll likely mask protein with X,
> >> not N, obviously. May I ask, if this person knows his/her sequence is
> >> protein then why doesn't s/he set its alphabet to "protein"?
> >> Or why don't they mask with A or Z or O or something?
> >>
> >>
> >>
> >> They'll be problems either way. What is one's reference? Public
> >> sequence or the less well-defined set of possible sequences?
> >>
> >>
> >>
> >> Brian O.
> >>
> >> -----Original Message-----
> >> From: Nathan Haigh [mailto:nathanhaigh at ukonline.co.uk]
> >> Sent: Wednesday, January 05, 2005 7:38 PM
> >> To: 'Brian Osborne'
> >> Subject: FW: SeqIO fails on masked sequences
> >>
> >> You committed a change to Bio::PrimarySeq where 'X' was added to the
> >> class of characters that are stripped out of sequences in the
> >> _guess_alphabet subroutine. Do you know why sequences containing X
> >> were causing a problem, and why X was added to the class of
> >> chars?
> >>
> >>
> >>
> >> It's causing a problem for someone who has a sequence that containes
> >> all masked chars (i.e. all X's), which should still be
> >> "guessable" as protein.
> >>
> >>
> >>
> >> Cheers
> >>
> >> Nathan
> >>
> >> ---
> >> avast! Antivirus: Outbound message clean.
> >> Virus Database (VPS): 0501-0, 04/01/2005
> >> Tested on: 06/01/2005 00:36:20
> >> avast! is copyright (c) 2000-2003 ALWIL Software.
> >> http://www.avast.com
> >>
> >>
> >>
> >> ---
> >> avast! Antivirus: Inbound message clean.
> >> Virus Database (VPS): 0501-0, 04/01/2005
> >> Tested on: 07/01/2005 00:35:30
> >> avast! is copyright (c) 2000-2003 ALWIL Software.
> >> http://www.avast.com
> >>
> >>
> >>
> >>
> >> ---
> >> avast! Antivirus: Outbound message clean.
> >> Virus Database (VPS): 0501-0, 04/01/2005
> >> Tested on: 07/01/2005 11:39:14
> >> avast! is copyright (c) 2000-2003 ALWIL Software.
> >> http://www.avast.com
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at portal.open-bio.org
> >> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> 
> 
> --
> Wes Barris
> E-Mail: Wes.Barris at csiro.au
> ---
> avast! Antivirus: Inbound message clean.
> Virus Database (VPS): 0501-1, 07/01/2005
> Tested on: 10/01/2005 00:20:13
> avast! is copyright (c) 2000-2003 ALWIL Software.
> http://www.avast.com
> 
> 

---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0501-1, 07/01/2005
Tested on: 10/01/2005 00:30:15
avast! is copyright (c) 2000-2003 ALWIL Software.
http://www.avast.com







More information about the Bioperl-l mailing list