[Bioperl-l] Output a subset of FASTA data from a single largefile

Mon Jun 12 15:47:09 UTC 2006

ooops, in my message 


>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org 
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
>Cook, Malcolm
>Sent: Monday, June 12, 2006 10:29 AM
>To: Chris Fields; Michael Oldham
>Cc: bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a 
>single largefile
>
>Michael,
>
>I don't think you can call perl's `print` on just a filehandle as you
>are doing.  This is probably your problem.
>
>If you call `select OUT` after opeining it, print will print $_ to it.
>And, every line in the fasta record whose header matches on of the IDS
>will get printed, not just the fasta header lines.  Read the code again
>nothing that $idmatch is only getting reset when a correctly formatted
>fasta header line is matched.
>
>--Malcolm
>
>
>>-----Original Message-----
>>From: Chris Fields [mailto:cjfields at uiuc.edu] 
>>Sent: Saturday, June 10, 2006 11:32 PM
>>To: Michael Oldham
>>Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
>>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a 
>>single large file
>>
>>What happens if you just print $idmatch or $1 (i.e. check to see if  
>>the regex matches anything)?  If there is nothing printed 
>then either  
>>the regex isn't working as expected or there is something logically  
>>wrong.  The problem may be that the captured string must 
>match the id  
>>exactly, the id being the key to the %ID hash; any extra characters  
>>picked up by the regex outside of your id key and you will not get  
>>anything.  Looking at Malcolm's regex it should work just fine, but  
>>we only had one example sequence to try here.
>>
>>If your while loop is set up like this won't it only print only the  
>>matched description lines to the outfile (no sequence) even if there  
>>is a match?  Or is this what you wanted?   If you want the sequence  
>>you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
>>
>>Chris
>>
>>On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:
>>
>>> Thanks to everyone for their helpful advice.  I think I am getting  
>>> closer,
>>> but no cigar quite yet.  The script below runs quickly with no  
>>> errors--but
>>> the output file is empty.  It seems that the problem must lie  
>>> somewhere in
>>> the 'while' loop, and I'm sure it's quite obvious to a more  
>>> experienced
>>> eye--but not to mine!  Any suggestions?  Thanks again for your help.
>>>
>>> --Mike O.
>>>
>>>
>>> #!/usr/bin/perl -w
>>>
>>> use strict;
>>>
>>> my $IDs = 'ID.dat.txt';
>>>
>>> unless (open(IDFILE, $IDs)) {
>>> 	print "Could not open file $IDs!\n";
>>> 	}
>>>
>>> my $probes = 'HG_U95Av2_probe_fasta.txt';
>>>
>>> unless (open(PROBES, $probes)) {
>>> 	print "Could not open file $probes!\n";
>>> 	}
>>>
>>> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>>>
>>> my @ID = <IDFILE>;
>>> chomp @ID;
>>> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with  
>>> keys=PSIDs and
>>> all values=1.
>>>
>>> 	while (<PROBES>) {
>>> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
>>> 		if ($idmatch){
>>> 			print OUT;
>>> 		}
>>> 	}
>>> exit;
>>>
>>>
>>> -----Original Message-----
>>> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
>>> Sent: Friday, June 09, 2006 7:58 AM
>>> To: Michael Oldham; bioperl-l at lists.open-bio.org
>>> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a  
>>> single large
>>> file
>>>
>>>
>>>
>>> I wouldn't bioperl for this, or create an index.  Perl would do  
>>> fine and
>>> probably be faster.
>>>
>>> Assuming your ids are one per line in a file named id.dat 
>>looking like
>>> this
>>>
>>> 1138_at
>>> 1134_at
>>> etc..
>>>
>>> this should work:
>>>
>>> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
>>> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
>>> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
>>> mybigfile.fa
>>>
>>> good luck
>>>
>>> --Malcolm Cook
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org
>>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>>>> Michael Oldham
>>>> Sent: Thursday, June 08, 2006 9:08 PM
>>>> To: bioperl-l at lists.open-bio.org
>>>> Subject: [Bioperl-l] Output a subset of FASTA data from a
>>>> single large file
>>>>
>>>> Dear all,
>>>>
>>>> I am a total Bioperl newbie struggling to accomplish a
>>>> conceptually simple
>>>> task.  I have a single large fasta file containing about 200,000  
>>>> probe
>>>> sequences (from an Affymetrix microarray), each of which looks
>>>> like this:
>>>>
>>>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>>>> Antisense;
>>>> TGGCTCCTGCTGAGGTCCCCTTTCC
>>>>
>>>> What I would like to do is extract from this file a subset of  
>>>> ~130,800
>>>> probes (both the header and the sequence) and output this
>>>> subset into a new
>>>> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
>>>> ("1138_at" is the probe set ID in the header listed above); I
>>>> have these
>>>> 8,175 IDs listed in a separate file.  I *think* that I managed
>>>> to create an
>>>> index of all 200,000 probes in the original fasta file using
>>>> the following
>>>> script:
>>>>
>>>> #!/usr/bin/perl -w
>>>>
>>>> # script 1: create the index
>>>>
>>>> use Bio::Index::Fasta;
>>>> use strict;
>>>> my $Index_File_Name = shift;
>>>> my $inx = Bio::Index::Fasta->new(
>>>>     -filename => $Index_File_Name,
>>>>     -write_flag => 1);
>>>> $inx->make_index(@ARGV);
>>>>
>>>> I'm not sure if this is the most sensible approach, and even
>>>> if it is, I'm
>>>> not sure what to do next.  Any help would be greatly appreciated!
>>>>
>>>> Many thanks,
>>>> Mike O.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> No virus found in this outgoing message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
>>>> 6/8/2006
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> --
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
>>> 6/8/2006
>>>
>>> --
>>> No virus found in this outgoing message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:  
>>> 6/9/2006
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>Christopher Fields
>>Postdoctoral Researcher
>>Lab of Dr. Robert Switzer
>>Dept of Biochemistry
>>University of Illinois Urbana-Champaign
>>
>>
>>
>>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>