[Bioperl-l] R: Re: R: Re: fasta file parser

Tue Jul 22 15:51:03 UTC 2008

chris, your suggestion sound reasonable, but I still prefer a file parser 
solution.

and I'd like to better understand your demo code, 'couse actually 
it's not working...can you just put a comment to the instructions you wrote?

thanks.
stefano

>----Messaggio originale----
>Da: cjfields at uiuc.edu
>Data: 
22/07/2008 17.01
>A: "ste.ghi at libero.it"<ste.ghi at libero.it>
>Cc: <bioperl-
l at lists.open-bio.org>
>Ogg: Re: R: Re: [Bioperl-l] fasta file parser
>
>
>On 
Jul 22, 2008, at 9:17 AM, ste.ghi at libero.it wrote:
>
>> guys, thanks for your 
ready replies
>> but now I'm a little confused...
>>
>> the first 'while' was 
just to avoid duplicates in the list...
>> does your code create a cleared 
list? what should I write where you  
>> say "# do whatever else here..."
>>
>> 
thanks
>
>Up to you; you can do absolutely nothing if you prefer.  The script 
is  
>simply a demo of what could be done based upon what we assume you want  

>using the code you provided.
>
>Sendu's approach is also quite applicable 
(remember, this is perl, so  
>TIMTOWTDI).  For instance, the way I would go 
about this myself: start  
>by creating a FASTA database (flatfile or 
otherwise) of the sequences  
>using the ID as the primary key (always a good 
practice IMHO).  After  
>the database is set up, create a lookup table by 
mapping the list of  
>IDs from the database to a hash, similarly to what Sendu 
demonstrated;  
>I believe all DB implementations in BioPerl allow you to 
retrieve all  
>seq IDs in the database as an array.
>
>Using that you would 
then delete any matching IDs in the file from the  
>lookup hash; the leftovers 
(i.e. those NOT found in the file) would be  
>used to retrieve the sequences 
from the database.  Avoids iterating  
>over every sequence (unless you count 
database creation, of course,  
>but again that could be used for other 
purposes as well).
>
>chris
>
>>
>> ----Messaggio originale----
>> Da: 
cjfields at uiuc.edu
>> Data: 22/07/2008 15.00
>> A: "ste.ghi at libero.it"<ste.
ghi at libero.it>
>> Cc: <bioperl-l at lists.open-bio.org>
>> Ogg: Re: [Bioperl-l] 
fasta file parser
>>
>> Use exists() directly on the %seen lookup hash for your 
test.  Also,  
>> use chomp up front when creating %seen to get rid of 
newlines.
>>
>> # assuming you have one ID per line...
>> while (my $line = 
<LIST>) {
>>    chomp $line;
>>    $seen{$line}++;
>>    # do whatever else 
here...
>> }
>>
>> close LIST;
>>
>> my $newSeqFileName = Bio::SeqIO->new(-
file=> ">>INFILE", - 
>> format=>'fasta');
>> while (my $query = $SeqFileName-
>next_seq()) {
>>     my $id = $query->id;
>>     if ( exists( $seen{$id} ) ) {

>>         print "$id matched with $seen{$id} listed in $ARGV[1]:  
>> skipped!
\n";
>>         next;
>>   }
>>     elsif ( $elem ne $query->id ) {
>>         
if ( exists( $seen2{$id}++ ) ) {;
>>             print "$id matched with $seen2
{$id}, already found:  
>> skipped!\n";
>>             next;
>>         }

>>         $newSeqFileName->write_seq($query);
>>     }
>> }
>>
>> chris
>>
>> 
On Jul 22, 2008, at 6:28 AM, ste.ghi at libero.it wrote:
>>
>>> Dear all,
>>> I'm 
trying to write a script wich, given a file containing a list of
>>> IDs, 
parses a big fasta file returning only sequences NOT listed in  
>>> the list-

>>> file.
>>>
>>> To do so, I first create an array with the IDs to be 
excluded:
>>>
>>> [...]
>>>
>>> #Load LIST content in an array; avoids 
duplicates
>>> while (my $line = <LIST>) {
>>>
>>>
>>>    push(@array1,$line );

>>>
>>>    foreach my $uniq ( @array1 ){
>>>
>>> 	next if $seen
>>> { $uniq }
++;
>>>
>>> 	push @unique, $uniq;
>>>
>>>    }
>>> }
>>>
>>> then, process the 
fasta file in
>>> this way (NOT WORKING).
>>>
>>> #Fasta file processing
>>> my 
$newSeqFileName  = Bio::
>>> SeqIO->new(-file=> ">>INFILE", -format=>'fasta');

>>> while (my $query =
>>> $SeqFileName->next_seq()) {
>>>       foreach my 
$elem(@unique){
>>> 		chomp $elem;
>>>
>>>
>>>       	if ($elem eq $query->id) 
{
>>>
>>>            		print $query->id." matched
>>> with $elem listed in $ARGV
[1]: skipped!\n";
>>>                        next;
>>> 		}
>>> elsif ($elem ne 
$query->id) {
>>>       			next if $seen2{ $query->id }++;
>>> 			
>>> 
$newSeqFileName->write_seq($query);
>>>
>>>            	}
>>>
>>>        }
>>> 
}
>>>
>>>
>>> ...
>>> in this way I get only an exact copy of the input file....
where am  
>>> I wrong?
>>>
>>> Thanks a lot for your kind help!
>>> Stefano

>>> _______________________________________________
>>> Bioperl-l mailing list

>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.
org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral 
Researcher
>> Lab of Dr. Marie-Claude Hofmann
>> College of Veterinary Medicine

>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>>
>>
>
>Christopher 
Fields
>Postdoctoral Researcher
>Lab of Dr. Marie-Claude Hofmann
>College of 
Veterinary Medicine
>University of Illinois Urbana-Champaign
>
>
>
>
>