[Biopython] how to validate fasta format
Yvan Strahm
yvan.strahm at bccs.uib.no
Tue Oct 27 12:03:11 UTC 2009
Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
>> Hello All,
>>
>> Is it possible to validate a sequence format, for example while the sequence
>> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
>> illegal characters in .seq?
>>
>> Cheers,
>> yvan
>
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
>
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
> if not wanted.isuperset(record.seq) :
> print "Bad: %s" % record.id
>
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
>
> Peter
Thanks for the quick reply.
Yes by validating I mainly meant check for the correct alphabet in the Seq object but also the
correct header's format. So I guess, I have to trust the user.... ;-)
thanks again
yvan
More information about the Biopython
mailing list