[Biopython] how to validate fasta format
Steve Darnell
darnells at dnastar.com
Tue Oct 27 13:14:20 UTC 2009
Greetings,
This particular thread addresses a topic we've revisited lately,
ambiguity codes (particularly in the amino acid alphabet). I would like
to query the group for their opinion of the remaining 6 characters after
you remove the 20 standard amino acids. Here's our list:
B - Asn or Asp
J - Ile or Leu
O - ???
U - seleno-Cys
X - Any
Z - Gln or Glu
~Steve
On Tue, 2009-10-27 at 10:08 +0000, Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
> > Hello All,
> >
> > Is it possible to validate a sequence format, for example while the sequence
> > is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
> > illegal characters in .seq?
> >
> > Cheers,
> > yvan
>
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
>
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
> if not wanted.isuperset(record.seq) :
> print "Bad: %s" % record.id
>
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
>
> Peter
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list