[Bioperl-l] Output a subset of FASTA data from a single large file

Mon Jun 12 15:28:41 UTC 2006

Michael,

I don't think you can call perl's `print` on just a filehandle as you
are doing.  This is probably your problem.

If you call `select OUT` after opeining it, print will print $_ to it.
And, every line in the fasta record whose header matches on of the IDS
will get printed, not just the fasta header lines.  Read the code again
nothing that $idmatch is only getting reset when a correctly formatted
fasta header line is matched.

--Malcolm

>-----Original Message-----
>From: Chris Fields [mailto:cjfields at uiuc.edu] 
>Sent: Saturday, June 10, 2006 11:32 PM
>To: Michael Oldham
>Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a 
>single large file
>
>What happens if you just print $idmatch or $1 (i.e. check to see if  
>the regex matches anything)?  If there is nothing printed then either  
>the regex isn't working as expected or there is something logically  
>wrong.  The problem may be that the captured string must match the id  
>exactly, the id being the key to the %ID hash; any extra characters  
>picked up by the regex outside of your id key and you will not get  
>anything.  Looking at Malcolm's regex it should work just fine, but  
>we only had one example sequence to try here.
>
>If your while loop is set up like this won't it only print only the  
>matched description lines to the outfile (no sequence) even if there  
>is a match?  Or is this what you wanted?   If you want the sequence  
>you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
>
>Chris
>
>On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:
>
>> Thanks to everyone for their helpful advice.  I think I am getting  
>> closer,
>> but no cigar quite yet.  The script below runs quickly with no  
>> errors--but
>> the output file is empty.  It seems that the problem must lie  
>> somewhere in
>> the 'while' loop, and I'm sure it's quite obvious to a more  
>> experienced
>> eye--but not to mine!  Any suggestions?  Thanks again for your help.
>>
>> --Mike O.
>>
>>
>> #!/usr/bin/perl -w
>>
>> use strict;
>>
>> my $IDs = 'ID.dat.txt';
>>
>> unless (open(IDFILE, $IDs)) {
>> 	print "Could not open file $IDs!\n";
>> 	}
>>
>> my $probes = 'HG_U95Av2_probe_fasta.txt';
>>
>> unless (open(PROBES, $probes)) {
>> 	print "Could not open file $probes!\n";
>> 	}
>>
>> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>>
>> my @ID = <IDFILE>;
>> chomp @ID;
>> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with  
>> keys=PSIDs and
>> all values=1.
>>
>> 	while (<PROBES>) {
>> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
>> 		if ($idmatch){
>> 			print OUT;
>> 		}
>> 	}
>> exit;
>>
>>
>> -----Original Message-----
>> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
>> Sent: Friday, June 09, 2006 7:58 AM
>> To: Michael Oldham; bioperl-l at lists.open-bio.org
>> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a  
>> single large
>> file
>>
>>
>>
>> I wouldn't bioperl for this, or create an index.  Perl would do  
>> fine and
>> probably be faster.
>>
>> Assuming your ids are one per line in a file named id.dat 
>looking like
>> this
>>
>> 1138_at
>> 1134_at
>> etc..
>>
>> this should work:
>>
>> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
>> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
>> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
>> mybigfile.fa
>>
>> good luck
>>
>> --Malcolm Cook
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org
>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>>> Michael Oldham
>>> Sent: Thursday, June 08, 2006 9:08 PM
>>> To: bioperl-l at lists.open-bio.org
>>> Subject: [Bioperl-l] Output a subset of FASTA data from a
>>> single large file
>>>
>>> Dear all,
>>>
>>> I am a total Bioperl newbie struggling to accomplish a
>>> conceptually simple
>>> task.  I have a single large fasta file containing about 200,000  
>>> probe
>>> sequences (from an Affymetrix microarray), each of which looks
>>> like this:
>>>
>>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>>> Antisense;
>>> TGGCTCCTGCTGAGGTCCCCTTTCC
>>>
>>> What I would like to do is extract from this file a subset of  
>>> ~130,800
>>> probes (both the header and the sequence) and output this
>>> subset into a new
>>> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
>>> ("1138_at" is the probe set ID in the header listed above); I
>>> have these
>>> 8,175 IDs listed in a separate file.  I *think* that I managed
>>> to create an
>>> index of all 200,000 probes in the original fasta file using
>>> the following
>>> script:
>>>
>>> #!/usr/bin/perl -w
>>>
>>> # script 1: create the index
>>>
>>> use Bio::Index::Fasta;
>>> use strict;
>>> my $Index_File_Name = shift;
>>> my $inx = Bio::Index::Fasta->new(
>>>     -filename => $Index_File_Name,
>>>     -write_flag => 1);
>>> $inx->make_index(@ARGV);
>>>
>>> I'm not sure if this is the most sensible approach, and even
>>> if it is, I'm
>>> not sure what to do next.  Any help would be greatly appreciated!
>>>
>>> Many thanks,
>>> Mike O.
>>>
>>>
>>>
>>>
>>> --
>>> No virus found in this outgoing message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
>>> 6/8/2006
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> --
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
>> 6/8/2006
>>
>> --
>> No virus found in this outgoing message.
>> Checked by AVG Free Edition.
>> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:  
>> 6/9/2006
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>Christopher Fields
>Postdoctoral Researcher
>Lab of Dr. Robert Switzer
>Dept of Biochemistry
>University of Illinois Urbana-Champaign
>
>
>
>