[Bioperl-l] R: Re: R: Re: fasta file parser
ste.ghi at libero.it
ste.ghi at libero.it
Tue Jul 22 15:51:03 UTC 2008
chris, your suggestion sound reasonable, but I still prefer a file parser
solution.
and I'd like to better understand your demo code, 'couse actually
it's not working...can you just put a comment to the instructions you wrote?
thanks.
stefano
>----Messaggio originale----
>Da: cjfields at uiuc.edu
>Data:
22/07/2008 17.01
>A: "ste.ghi at libero.it"<ste.ghi at libero.it>
>Cc: <bioperl-
l at lists.open-bio.org>
>Ogg: Re: R: Re: [Bioperl-l] fasta file parser
>
>
>On
Jul 22, 2008, at 9:17 AM, ste.ghi at libero.it wrote:
>
>> guys, thanks for your
ready replies
>> but now I'm a little confused...
>>
>> the first 'while' was
just to avoid duplicates in the list...
>> does your code create a cleared
list? what should I write where you
>> say "# do whatever else here..."
>>
>>
thanks
>
>Up to you; you can do absolutely nothing if you prefer. The script
is
>simply a demo of what could be done based upon what we assume you want
>using the code you provided.
>
>Sendu's approach is also quite applicable
(remember, this is perl, so
>TIMTOWTDI). For instance, the way I would go
about this myself: start
>by creating a FASTA database (flatfile or
otherwise) of the sequences
>using the ID as the primary key (always a good
practice IMHO). After
>the database is set up, create a lookup table by
mapping the list of
>IDs from the database to a hash, similarly to what Sendu
demonstrated;
>I believe all DB implementations in BioPerl allow you to
retrieve all
>seq IDs in the database as an array.
>
>Using that you would
then delete any matching IDs in the file from the
>lookup hash; the leftovers
(i.e. those NOT found in the file) would be
>used to retrieve the sequences
from the database. Avoids iterating
>over every sequence (unless you count
database creation, of course,
>but again that could be used for other
purposes as well).
>
>chris
>
>>
>> ----Messaggio originale----
>> Da:
cjfields at uiuc.edu
>> Data: 22/07/2008 15.00
>> A: "ste.ghi at libero.it"<ste.
ghi at libero.it>
>> Cc: <bioperl-l at lists.open-bio.org>
>> Ogg: Re: [Bioperl-l]
fasta file parser
>>
>> Use exists() directly on the %seen lookup hash for your
test. Also,
>> use chomp up front when creating %seen to get rid of
newlines.
>>
>> # assuming you have one ID per line...
>> while (my $line =
<LIST>) {
>> chomp $line;
>> $seen{$line}++;
>> # do whatever else
here...
>> }
>>
>> close LIST;
>>
>> my $newSeqFileName = Bio::SeqIO->new(-
file=> ">>INFILE", -
>> format=>'fasta');
>> while (my $query = $SeqFileName-
>next_seq()) {
>> my $id = $query->id;
>> if ( exists( $seen{$id} ) ) {
>> print "$id matched with $seen{$id} listed in $ARGV[1]:
>> skipped!
\n";
>> next;
>> }
>> elsif ( $elem ne $query->id ) {
>>
if ( exists( $seen2{$id}++ ) ) {;
>> print "$id matched with $seen2
{$id}, already found:
>> skipped!\n";
>> next;
>> }
>> $newSeqFileName->write_seq($query);
>> }
>> }
>>
>> chris
>>
>>
On Jul 22, 2008, at 6:28 AM, ste.ghi at libero.it wrote:
>>
>>> Dear all,
>>> I'm
trying to write a script wich, given a file containing a list of
>>> IDs,
parses a big fasta file returning only sequences NOT listed in
>>> the list-
>>> file.
>>>
>>> To do so, I first create an array with the IDs to be
excluded:
>>>
>>> [...]
>>>
>>> #Load LIST content in an array; avoids
duplicates
>>> while (my $line = <LIST>) {
>>>
>>>
>>> push(@array1,$line );
>>>
>>> foreach my $uniq ( @array1 ){
>>>
>>> next if $seen
>>> { $uniq }
++;
>>>
>>> push @unique, $uniq;
>>>
>>> }
>>> }
>>>
>>> then, process the
fasta file in
>>> this way (NOT WORKING).
>>>
>>> #Fasta file processing
>>> my
$newSeqFileName = Bio::
>>> SeqIO->new(-file=> ">>INFILE", -format=>'fasta');
>>> while (my $query =
>>> $SeqFileName->next_seq()) {
>>> foreach my
$elem(@unique){
>>> chomp $elem;
>>>
>>>
>>> if ($elem eq $query->id)
{
>>>
>>> print $query->id." matched
>>> with $elem listed in $ARGV
[1]: skipped!\n";
>>> next;
>>> }
>>> elsif ($elem ne
$query->id) {
>>> next if $seen2{ $query->id }++;
>>>
>>>
$newSeqFileName->write_seq($query);
>>>
>>> }
>>>
>>> }
>>>
}
>>>
>>>
>>> ...
>>> in this way I get only an exact copy of the input file....
where am
>>> I wrong?
>>>
>>> Thanks a lot for your kind help!
>>> Stefano
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.
org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral
Researcher
>> Lab of Dr. Marie-Claude Hofmann
>> College of Veterinary Medicine
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>>
>>
>
>Christopher
Fields
>Postdoctoral Researcher
>Lab of Dr. Marie-Claude Hofmann
>College of
Veterinary Medicine
>University of Illinois Urbana-Champaign
>
>
>
>
>
More information about the Bioperl-l
mailing list