[Bioperl-l] fasta file parser

Chris Fields cjfields at uiuc.edu
Tue Jul 22 13:00:06 UTC 2008


Use exists() directly on the %seen lookup hash for your test.  Also,  
use chomp up front when creating %seen to get rid of newlines.

# assuming you have one ID per line...
while (my $line = <LIST>) {
    chomp $line;
    $seen{$line}++;
    # do whatever else here...
}

close LIST;

my $newSeqFileName = Bio::SeqIO->new(-file=> ">>INFILE", - 
format=>'fasta');
while (my $query = $SeqFileName->next_seq()) {
     my $id = $query->id;
     if ( exists( $seen{$id} ) ) {
         print "$id matched with $seen{$id} listed in $ARGV[1]:  
skipped!\n";
         next;
     }
     elsif ( $elem ne $query->id ) {
         if ( exists( $seen2{$id}++ ) ) {;
             print "$id matched with $seen2{$id}, already found:  
skipped!\n";
             next;
         }
         $newSeqFileName->write_seq($query);
     }
}

chris

On Jul 22, 2008, at 6:28 AM, ste.ghi at libero.it wrote:

> Dear all,
> I'm trying to write a script wich, given a file containing a list of
> IDs, parses a big fasta file returning only sequences NOT listed in  
> the list-
> file.
>
> To do so, I first create an array with the IDs to be excluded:
>
> [...]
>
> #Load LIST content in an array; avoids duplicates
> while (my $line = <LIST>) {
>
>
>    push(@array1,$line );
>
>    foreach my $uniq ( @array1 ){
>
> 	next if $seen
> { $uniq }++;
>
> 	push @unique, $uniq;
>
>    }
> }
>
> then, process the fasta file in
> this way (NOT WORKING).
>
> #Fasta file processing
> my $newSeqFileName  = Bio::
> SeqIO->new(-file=> ">>INFILE", -format=>'fasta');
> while (my $query =
> $SeqFileName->next_seq()) {
>       foreach my $elem(@unique){
> 		chomp $elem;
>
>
>       	if ($elem eq $query->id) {
>
>            		print $query->id." matched
> with $elem listed in $ARGV[1]: skipped!\n";
>                        next;
> 		}
> elsif ($elem ne $query->id) {
>       			next if $seen2{ $query->id }++;
> 			
> $newSeqFileName->write_seq($query);
>
>            	}
>
>        }
> }
>
>
> ...
> in this way I get only an exact copy of the input file....where am I  
> wrong?
>
> Thanks a lot for your kind help!
> Stefano
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign







More information about the Bioperl-l mailing list