[Bioperl-l] Output a subset of FASTA data from a single largefile

Wed Jun 14 02:03:04 UTC 2006

Dear Malcolm, Chris, et al,

Thanks to everyone for your helpful suggestions.  When I run the code
below using an ID list ('ID_dat.txt') containing all 8175 IDs, the
output file is still blank.  If I replace this list with a single ID
("542_at"), it works:

>probe:HG_U95Av2:542_at:628:567; Interrogation_Position=116; Antisense;
GCGCAGCAGCGAGAATTTCGACGAG
>probe:HG_U95Av2:542_at:610:383; Interrogation_Position=128; Antisense;
GAATTTCGACGAGCTGCTGAAGGCA
>probe:HG_U95Av2:542_at:289:599; Interrogation_Position=134; Antisense;
CGACGAGCTGCTGAAGGCACTGGGT
........etc.

If I try a list of two IDs ("542_at" and "31799_at"), only the last one
is present in the output:

>probe:HG_U95Av2:31799_at:181:1; Interrogation_Position=1086; Antisense;
GTTCATCACAAATCTATTGTGCTTG
>probe:HG_U95Av2:31799_at:534:511; Interrogation_Position=1126;
Antisense;
GTCCACTAAATGTAGTAACGAAATG
>probe:HG_U95Av2:31799_at:226:183; Interrogation_Position=1127;
Antisense;
TCCACTAAATGTAGTAACGAAATGT
........etc.

The same thing seems to happen if I go to 3 IDs, or 4 IDs (only the last
ID is present in the output file).  At this point I have no idea why
this is happening, and I am not sure how to interpret Malcolm's comment:

oops,

s/matches on of/matches one of/
s/nothing that/noting that/

Any ideas?  Thanks again................!

Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID_dat.txt';

unless (open(IDFILE, $IDs)) {
	print "Could not open file $IDs!\n";
	}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
	print "Could not open file $probes!\n";
	}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs
and all values=1.

	while (<PROBES>) {
		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
		if ($idmatch){
			print OUT;
			print OUT scalar(<PROBES>);
		}
	}
exit;

-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Monday, June 12, 2006 8:48 AM
To: Cook, Malcolm; Chris Fields; Michael Oldham
Cc: bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single
largefile

oops,

s/matches on of/matches one of/
s/nothing that/noting that/

--Malcolm

>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>Cook, Malcolm
>Sent: Monday, June 12, 2006 10:29 AM
>To: Chris Fields; Michael Oldham
>Cc: bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
>single largefile
>
>Michael,
>
>I don't think you can call perl's `print` on just a filehandle as you
>are doing.  This is probably your problem.
>
>If you call `select OUT` after opeining it, print will print $_ to it.
>And, every line in the fasta record whose header matches on of the IDS
>will get printed, not just the fasta header lines.  Read the code again
>nothing that $idmatch is only getting reset when a correctly formatted
>fasta header line is matched.
>
>--Malcolm
>
>
>>-----Original Message-----
>>From: Chris Fields [mailto:cjfields at uiuc.edu]
>>Sent: Saturday, June 10, 2006 11:32 PM
>>To: Michael Oldham
>>Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
>>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
>>single large file
>>
>>What happens if you just print $idmatch or $1 (i.e. check to see if
>>the regex matches anything)?  If there is nothing printed
>then either
>>the regex isn't working as expected or there is something logically
>>wrong.  The problem may be that the captured string must
>match the id
>>exactly, the id being the key to the %ID hash; any extra characters
>>picked up by the regex outside of your id key and you will not get
>>anything.  Looking at Malcolm's regex it should work just fine, but
>>we only had one example sequence to try here.
>>
>>If your while loop is set up like this won't it only print only the
>>matched description lines to the outfile (no sequence) even if there
>>is a match?  Or is this what you wanted?   If you want the sequence
>>you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
>>
>>Chris
>>
>>On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:
>>
>>> Thanks to everyone for their helpful advice.  I think I am getting
>>> closer,
>>> but no cigar quite yet.  The script below runs quickly with no
>>> errors--but
>>> the output file is empty.  It seems that the problem must lie
>>> somewhere in
>>> the 'while' loop, and I'm sure it's quite obvious to a more
>>> experienced
>>> eye--but not to mine!  Any suggestions?  Thanks again for your help.
>>>
>>> --Mike O.
>>>
>>>
>>> #!/usr/bin/perl -w
>>>
>>> use strict;
>>>
>>> my $IDs = 'ID.dat.txt';
>>>
>>> unless (open(IDFILE, $IDs)) {
>>> 	print "Could not open file $IDs!\n";
>>> 	}
>>>
>>> my $probes = 'HG_U95Av2_probe_fasta.txt';
>>>
>>> unless (open(PROBES, $probes)) {
>>> 	print "Could not open file $probes!\n";
>>> 	}
>>>
>>> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>>>
>>> my @ID = <IDFILE>;
>>> chomp @ID;
>>> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with
>>> keys=PSIDs and
>>> all values=1.
>>>
>>> 	while (<PROBES>) {
>>> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
>>> 		if ($idmatch){
>>> 			print OUT;
>>> 		}
>>> 	}
>>> exit;
>>>
>>>
>>> -----Original Message-----
>>> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
>>> Sent: Friday, June 09, 2006 7:58 AM
>>> To: Michael Oldham; bioperl-l at lists.open-bio.org
>>> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
>>> single large
>>> file
>>>
>>>
>>>
>>> I wouldn't bioperl for this, or create an index.  Perl would do
>>> fine and
>>> probably be faster.
>>>
>>> Assuming your ids are one per line in a file named id.dat
>>looking like
>>> this
>>>
>>> 1138_at
>>> 1134_at
>>> etc..
>>>
>>> this should work:
>>>
>>> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
>>> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
>>> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
>>> mybigfile.fa
>>>
>>> good luck
>>>
>>> --Malcolm Cook
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org
>>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>>>> Michael Oldham
>>>> Sent: Thursday, June 08, 2006 9:08 PM
>>>> To: bioperl-l at lists.open-bio.org
>>>> Subject: [Bioperl-l] Output a subset of FASTA data from a
>>>> single large file
>>>>
>>>> Dear all,
>>>>
>>>> I am a total Bioperl newbie struggling to accomplish a
>>>> conceptually simple
>>>> task.  I have a single large fasta file containing about 200,000
>>>> probe
>>>> sequences (from an Affymetrix microarray), each of which looks
>>>> like this:
>>>>
>>>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>>>> Antisense;
>>>> TGGCTCCTGCTGAGGTCCCCTTTCC
>>>>
>>>> What I would like to do is extract from this file a subset of
>>>> ~130,800
>>>> probes (both the header and the sequence) and output this
>>>> subset into a new
>>>> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
>>>> ("1138_at" is the probe set ID in the header listed above); I
>>>> have these
>>>> 8,175 IDs listed in a separate file.  I *think* that I managed
>>>> to create an
>>>> index of all 200,000 probes in the original fasta file using
>>>> the following
>>>> script:
>>>>
>>>> #!/usr/bin/perl -w
>>>>
>>>> # script 1: create the index
>>>>
>>>> use Bio::Index::Fasta;
>>>> use strict;
>>>> my $Index_File_Name = shift;
>>>> my $inx = Bio::Index::Fasta->new(
>>>>     -filename => $Index_File_Name,
>>>>     -write_flag => 1);
>>>> $inx->make_index(@ARGV);
>>>>
>>>> I'm not sure if this is the most sensible approach, and even
>>>> if it is, I'm
>>>> not sure what to do next.  Any help would be greatly appreciated!
>>>>
>>>> Many thanks,
>>>> Mike O.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> No virus found in this outgoing message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
>>>> 6/8/2006
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> --
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
>>> 6/8/2006
>>>
>>> --
>>> No virus found in this outgoing message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:
>>> 6/9/2006
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>Christopher Fields
>>Postdoctoral Researcher
>>Lab of Dr. Robert Switzer
>>Dept of Biochemistry
>>University of Illinois Urbana-Champaign
>>
>>
>>
>>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/361 - Release Date: 6/11/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006