[Bioperl-l] Output a subset of FASTA data from a single largefile

Fri Jun 23 16:18:39 UTC 2006

Hello again,

I finally got it to work, using the following script.  However, it takes
about 5 hours to run on a fast computer.  Using grep (in bash), on the
other hand, takes about 5 minutes (see below if you are interested).
Thanks to everyone for your help!

SLOW perl script:

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID_all_X';

unless (open(IDFILE, $IDs)) {
	print "Could not open file $IDs!\n";
	}

my $probes = 'HG_U95Av2_probe_fasta';

unless (open(PROBES, $probes)) {
	print "Could not open file $probes!\n";
	}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
print @ID;
chomp @ID;

while (my $line = <PROBES>) {
	foreach my $identifier (@ID) {
		if($line=~/^>probe:\w+:$identifier:/) {
				print OUT $line;
				print OUT scalar(<PROBES>);
		}
	}
}
exit;

FAST bash script:

#!/usr/bin/bash
exec<"ID_all_X"
while read line; do
	echo $line
	grep -A 1 :$line: HG_U95Av2_probe_fasta >>myresults.txt
done

-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Wednesday, June 14, 2006 6:48 AM
To: Michael Oldham; Chris Fields
Cc: bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single
largefile

Did you try my one-liner?

Anyway, try this
	1) predeclare $idmatch before the while loop
	2) use ` select OUT` and print with no args to get $_ into it

like this:

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID_dat.txt';

unless (open(IDFILE, $IDs)) {
	print "Could not open file $IDs!\n";
	}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
	print "Could not open file $probes!\n";
	}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
select OUT;

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs
and all values=1.
my $idmatch;
	while (<PROBES>) {
		$idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
		if ($idmatch){
			print ;
		}
	}
exit;

>-----Original Message-----
>From: Michael Oldham [mailto:oldham at ucla.edu]
>Sent: Tuesday, June 13, 2006 9:03 PM
>To: Cook, Malcolm; Chris Fields
>Cc: bioperl-l at lists.open-bio.org
>Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
>single largefile
>
>Dear Malcolm, Chris, et al,
>
>Thanks to everyone for your helpful suggestions.  When I run the code
>below using an ID list ('ID_dat.txt') containing all 8175 IDs, the
>output file is still blank.  If I replace this list with a single ID
>("542_at"), it works:
>
>>probe:HG_U95Av2:542_at:628:567; Interrogation_Position=116; Antisense;
>GCGCAGCAGCGAGAATTTCGACGAG
>>probe:HG_U95Av2:542_at:610:383; Interrogation_Position=128; Antisense;
>GAATTTCGACGAGCTGCTGAAGGCA
>>probe:HG_U95Av2:542_at:289:599; Interrogation_Position=134; Antisense;
>CGACGAGCTGCTGAAGGCACTGGGT
>........etc.
>
>If I try a list of two IDs ("542_at" and "31799_at"), only the last one
>is present in the output:
>
>>probe:HG_U95Av2:31799_at:181:1; Interrogation_Position=1086;
>Antisense;
>GTTCATCACAAATCTATTGTGCTTG
>>probe:HG_U95Av2:31799_at:534:511; Interrogation_Position=1126;
>Antisense;
>GTCCACTAAATGTAGTAACGAAATG
>>probe:HG_U95Av2:31799_at:226:183; Interrogation_Position=1127;
>Antisense;
>TCCACTAAATGTAGTAACGAAATGT
>........etc.
>
>The same thing seems to happen if I go to 3 IDs, or 4 IDs
>(only the last
>ID is present in the output file).  At this point I have no idea why
>this is happening, and I am not sure how to interpret
>Malcolm's comment:
>
>oops,
>
>s/matches on of/matches one of/
>s/nothing that/noting that/
>
>Any ideas?  Thanks again................!
>
>Mike O.
>
>
>#!/usr/bin/perl -w
>
>use strict;
>
>my $IDs = 'ID_dat.txt';
>
>unless (open(IDFILE, $IDs)) {
>	print "Could not open file $IDs!\n";
>	}
>
>my $probes = 'HG_U95Av2_probe_fasta.txt';
>
>unless (open(PROBES, $probes)) {
>	print "Could not open file $probes!\n";
>	}
>
>open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
>my @ID = <IDFILE>;
>chomp @ID;
>my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs
>and all values=1.
>
>	while (<PROBES>) {
>		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
>		if ($idmatch){
>			print OUT;
>			print OUT scalar(<PROBES>);
>		}
>	}
>exit;
>
>
>-----Original Message-----
>From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
>Sent: Monday, June 12, 2006 8:48 AM
>To: Cook, Malcolm; Chris Fields; Michael Oldham
>Cc: bioperl-l at lists.open-bio.org
>Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single
>largefile
>
>
>oops,
>
>s/matches on of/matches one of/
>s/nothing that/noting that/
>
>--Malcolm
>
>
>>-----Original Message-----
>>From: bioperl-l-bounces at lists.open-bio.org
>>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>>Cook, Malcolm
>>Sent: Monday, June 12, 2006 10:29 AM
>>To: Chris Fields; Michael Oldham
>>Cc: bioperl-l at lists.open-bio.org
>>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
>>single largefile
>>
>>Michael,
>>
>>I don't think you can call perl's `print` on just a filehandle as you
>>are doing.  This is probably your problem.
>>
>>If you call `select OUT` after opeining it, print will print $_ to it.
>>And, every line in the fasta record whose header matches on of the IDS
>>will get printed, not just the fasta header lines.  Read the
>code again
>>nothing that $idmatch is only getting reset when a correctly formatted
>>fasta header line is matched.
>>
>>--Malcolm
>>
>>
>>>-----Original Message-----
>>>From: Chris Fields [mailto:cjfields at uiuc.edu]
>>>Sent: Saturday, June 10, 2006 11:32 PM
>>>To: Michael Oldham
>>>Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
>>>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
>>>single large file
>>>
>>>What happens if you just print $idmatch or $1 (i.e. check to see if
>>>the regex matches anything)?  If there is nothing printed
>>then either
>>>the regex isn't working as expected or there is something logically
>>>wrong.  The problem may be that the captured string must
>>match the id
>>>exactly, the id being the key to the %ID hash; any extra characters
>>>picked up by the regex outside of your id key and you will not get
>>>anything.  Looking at Malcolm's regex it should work just fine, but
>>>we only had one example sequence to try here.
>>>
>>>If your while loop is set up like this won't it only print only the
>>>matched description lines to the outfile (no sequence) even if there
>>>is a match?  Or is this what you wanted?   If you want the sequence
>>>you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
>>>
>>>Chris
>>>
>>>On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:
>>>
>>>> Thanks to everyone for their helpful advice.  I think I am getting
>>>> closer,
>>>> but no cigar quite yet.  The script below runs quickly with no
>>>> errors--but
>>>> the output file is empty.  It seems that the problem must lie
>>>> somewhere in
>>>> the 'while' loop, and I'm sure it's quite obvious to a more
>>>> experienced
>>>> eye--but not to mine!  Any suggestions?  Thanks again for
>your help.
>>>>
>>>> --Mike O.
>>>>
>>>>
>>>> #!/usr/bin/perl -w
>>>>
>>>> use strict;
>>>>
>>>> my $IDs = 'ID.dat.txt';
>>>>
>>>> unless (open(IDFILE, $IDs)) {
>>>> 	print "Could not open file $IDs!\n";
>>>> 	}
>>>>
>>>> my $probes = 'HG_U95Av2_probe_fasta.txt';
>>>>
>>>> unless (open(PROBES, $probes)) {
>>>> 	print "Could not open file $probes!\n";
>>>> 	}
>>>>
>>>> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>>>>
>>>> my @ID = <IDFILE>;
>>>> chomp @ID;
>>>> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with
>>>> keys=PSIDs and
>>>> all values=1.
>>>>
>>>> 	while (<PROBES>) {
>>>> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
>>>> 		if ($idmatch){
>>>> 			print OUT;
>>>> 		}
>>>> 	}
>>>> exit;
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
>>>> Sent: Friday, June 09, 2006 7:58 AM
>>>> To: Michael Oldham; bioperl-l at lists.open-bio.org
>>>> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
>>>> single large
>>>> file
>>>>
>>>>
>>>>
>>>> I wouldn't bioperl for this, or create an index.  Perl would do
>>>> fine and
>>>> probably be faster.
>>>>
>>>> Assuming your ids are one per line in a file named id.dat
>>>looking like
>>>> this
>>>>
>>>> 1138_at
>>>> 1134_at
>>>> etc..
>>>>
>>>> this should work:
>>>>
>>>> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
>>>> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
>>>> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
>>>> mybigfile.fa
>>>>
>>>> good luck
>>>>
>>>> --Malcolm Cook
>>>>
>>>>> -----Original Message-----
>>>>> From: bioperl-l-bounces at lists.open-bio.org
>>>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>>>>> Michael Oldham
>>>>> Sent: Thursday, June 08, 2006 9:08 PM
>>>>> To: bioperl-l at lists.open-bio.org
>>>>> Subject: [Bioperl-l] Output a subset of FASTA data from a
>>>>> single large file
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I am a total Bioperl newbie struggling to accomplish a
>>>>> conceptually simple
>>>>> task.  I have a single large fasta file containing about 200,000
>>>>> probe
>>>>> sequences (from an Affymetrix microarray), each of which looks
>>>>> like this:
>>>>>
>>>>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>>>>> Antisense;
>>>>> TGGCTCCTGCTGAGGTCCCCTTTCC
>>>>>
>>>>> What I would like to do is extract from this file a subset of
>>>>> ~130,800
>>>>> probes (both the header and the sequence) and output this
>>>>> subset into a new
>>>>> fasta file.  These 130,800 probes correspond to 8,175
>probe set IDs
>>>>> ("1138_at" is the probe set ID in the header listed above); I
>>>>> have these
>>>>> 8,175 IDs listed in a separate file.  I *think* that I managed
>>>>> to create an
>>>>> index of all 200,000 probes in the original fasta file using
>>>>> the following
>>>>> script:
>>>>>
>>>>> #!/usr/bin/perl -w
>>>>>
>>>>> # script 1: create the index
>>>>>
>>>>> use Bio::Index::Fasta;
>>>>> use strict;
>>>>> my $Index_File_Name = shift;
>>>>> my $inx = Bio::Index::Fasta->new(
>>>>>     -filename => $Index_File_Name,
>>>>>     -write_flag => 1);
>>>>> $inx->make_index(@ARGV);
>>>>>
>>>>> I'm not sure if this is the most sensible approach, and even
>>>>> if it is, I'm
>>>>> not sure what to do next.  Any help would be greatly appreciated!
>>>>>
>>>>> Many thanks,
>>>>> Mike O.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> No virus found in this outgoing message.
>>>>> Checked by AVG Free Edition.
>>>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
>>>>> 6/8/2006
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>> --
>>>> No virus found in this incoming message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
>>>> 6/8/2006
>>>>
>>>> --
>>>> No virus found in this outgoing message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:
>>>> 6/9/2006
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>Christopher Fields
>>>Postdoctoral Researcher
>>>Lab of Dr. Robert Switzer
>>>Dept of Biochemistry
>>>University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>>
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at lists.open-bio.org
>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.394 / Virus Database: 268.8.3/361 - Release Date:
>6/11/2006
>
>--
>No virus found in this outgoing message.
>Checked by AVG Free Edition.
>Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date:
>6/13/2006
>
>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.9.2/373 - Release Date: 6/22/2006