[Bioperl-l] Validate Fasta

James Wasmuth james.wasmuth at ed.ac.uk
Wed Mar 3 06:23:01 EST 2004


from my understanding the validator was built for speed, rather than 
complete accuracy.

It WILL fail if there is no header, but when checking the actual 
sequence it uses (or used) a number of general rules.

They were something along the lines of:

if 80% G T C A then DNA
something with U's in it is RNA
and anything else is protein.

Forgive me if I got that wrong, it was a while back that I remember this 
from.

If you want to check the sequence then write one...

for DNA:  unless ($seq=~/^[GCTA\n]+$/)   {
    print "Error!\n";
}

james


>
>
> john herbert wrote:
>
>>Interestingly, it also does not complain if you convert the fasta Perl
>>to EMBL format either :-)
>>
>>ID   #!/usr/bin/perlstandard; AA; UNK; 527 BP.
>>XX
>>AC   unknown;
>>XX
>>DE   
>>XX
>>FH   Key             Location/Qualifiers
>>FH
>>XX
>>SQ   Sequence 527 BP; 38 A; 19 C; 5 G; 29 T; 436 other;
>>     my$backups ={'mysql'= "/mick/mys ql/",'apac he'="/res/ upity/apac 
>>      60
>>     he",'mwats on'="/res/ upity/mwat son",'www' ="/www/doc s",'ensemb 
>>     120
>>     l'="/too/f ools/ensem bl",'cgi'= "/www/cgi- bin/"};my$ location=" 
>>     180
>>     /mick/back ups";my$da te=`date`; my at date=sp lit(/\s+/, $date);my$ 
>>     240
>>     date=join( "_", at date[ 0..2],$dat e[$#date]) ;print"$da te\n";#whi 
>>     300
>>     le(my($nam e,$dir)=ea ch%{$backu ps}){forea ch$name(qw (apachemys 
>>     360
>>     qlmwatsonw wwensemblc gi)){$dir= $backups-{ $name};pri nt"tarzipp 
>>     420
>>     ing$dir\n" ;system("/ bin/tar-c$ dir$locati on/$name.$ date.tar") 
>>     480
>>     ;system("/ bin/gzip$l ocation/$n ame.$date. tar");}               
>>     527
>>//
>>
>>
>>
>>  
>>
>>>>>"michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk> 03/03/2004
>>>>>        
>>>>>
>>10:52:58 >>>
>>Thanks for youe help, but I am afraid not....
>>
>>-----Original Message-----
>>From: john herbert
>>[mailto:john.herbert at clinical-pharmacology.oxford.ac.uk] 
>>Sent: 03 March 2004 10:45
>>To: michael.watson at bbsrc.ac.uk; bioperl-l at portal.open-bio.org 
>>Subject: Re: [Bioperl-l] Validate Fasta
>>
>>
>>Hello Michael.
>>Im not a BioPerl extra-ordinaire programmer (so anyone correct me if I
>>am wrong) but I think the -format flag should help here. 
>>
>>Try 
>>
>>my $in = Bio::SeqIO->new(-file => "rubbish.fasta", -format =>
>>'Fasta');
>>my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta", -format =>
>>'Fasta');
>>
>>I am pretty sure if you put this change in your code and run it on
>>your
>>very nice Perl fasta sequence, it will complain. 
>>
>>Kind regards,
>>
>>JOhn.
>>
>>
>>  
>>
>>>>>"michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk> 03/03/2004
>>>>>        
>>>>>
>>10:16:04 >>>
>>Hi
>>
>>I have searched the archives and only come up with one answer, and it
>>didn't work - I want to validate a FASTA sequence (DNA).  What I mean
>>is
>>that if I am given a perfect FASTA sequence, then thats ok, but if
>>there
>>are ANY whitespace characters, or any other characters that really
>>shouldn't be there, I want it to throw an error.  The script below was
>>suggested by Jason in 2002:
>>
>>use Bio::SeqIO;
>>
>>my $in = Bio::SeqIO->new(-file => "rubbish.fasta");
>>my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta");
>>
>>eval {
>>	LOOP: while( my $seq = $in->next_seq ) {
>>		$out->write_seq($seq);
>>	}
>>
>>};
>>if( $@) {
>>	print "There's an Error!\n";
>>	goto LOOP;
>>}
>>
>>I actually fired this at one of my scripts, a perl script that clearly
>>wasn't a fasta sequence - it has #'s, \ts, \ns and all sorts of non
>>DNA
>>sequence characters.  Here is the result:
>>
>>  
>>
>>>#!/usr/bin/perl
>>>    
>>>
>>my$backups={'mysql'="/mick/mysql/",'apache'="/res/upity/apac
>>he",'mwatson'="/res/upity/mwatson",'www'="/www/Docs",'ensemb
>>l'="/too/fools/ensembl",'cgi'="/www/cgi-bin/"};my$location="
>>/mick/backups";my$date=`date`;my at date=split(/\s+/,$date);my$
>>date=join("_", at date[0..2],$date[$#date]);print"$date\n";#whi 
>>le(my($name,$dir)=each%{$backups}){foreach$name(qw(apachemys
>>qlmwatsonwwwensemblcgi)){$dir=$backups-{$name};print"tarzipp
>>ing$dir\n";system("/bin/tar-c$dir$location/$name.$date.tar")
>>;system("/bin/gzip$location/$name.$date.tar");}
>>
>>This is undoubtedly a wonderfully FASTA formatted perl script, but...
>>
>>Anyone?  Any ideas?
>>
>>Thanks in advance for the help!
>>
>>Mick
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org 
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l 
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org 
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>  
>>
>
>-- 
>Nematode Bioinformatics           ||
>Blaxter Nematode Genomics Group   ||
>School of Biological Sciences     ||
>Ashworth Laboratories             ||	
>King's Buildings                  ||    tel: +44 131 650 7403
>University of Edinburgh           ||    web: www.nematodes.org
>Edinburgh                         ||
>EH9 3JT                           ||
>UK                                ||	
>
>"I have not failed. I've just found 10,000 ways that don't work."
>               --- Thomas Edison
>  
>

-- 
Nematode Bioinformatics           ||
Blaxter Nematode Genomics Group   ||
School of Biological Sciences     ||
Ashworth Laboratories             ||	
King's Buildings                  ||    tel: +44 131 650 7403
University of Edinburgh           ||    web: www.nematodes.org
Edinburgh                         ||
EH9 3JT                           ||
UK                                ||	

"I have not failed. I've just found 10,000 ways that don't work."
               --- Thomas Edison



More information about the Bioperl-l mailing list