[Biopython] how to validate fasta format

Tue Oct 27 13:14:20 UTC 2009

Greetings,

This particular thread addresses a topic we've revisited lately,
ambiguity codes (particularly in the amino acid alphabet).  I would like
to query the group for their opinion of the remaining 6 characters after
you remove the 20 standard amino acids.  Here's our list:

B - Asn or Asp
J - Ile or Leu
O - ???
U - seleno-Cys
X - Any
Z - Gln or Glu

~Steve

On Tue, 2009-10-27 at 10:08 +0000, Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
> > Hello All,
> >
> > Is it possible to validate a sequence format, for example while the sequence
> > is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
> > illegal characters in .seq?
> >
> > Cheers,
> > yvan
> 
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
> 
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
>     if not wanted.isuperset(record.seq) :
>          print "Bad: %s" % record.id
> 
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
> 
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython