[Bioperl-l] Output a subset of FASTA data from a single large file
Michael Oldham
oldham at ucla.edu
Sat Jun 10 01:39:45 UTC 2006
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single large
file
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat looking like
this
1138_at
1134_at
etc..
this should work:
perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>Michael Oldham
>Sent: Thursday, June 08, 2006 9:08 PM
>To: bioperl-l at lists.open-bio.org
>Subject: [Bioperl-l] Output a subset of FASTA data from a
>single large file
>
>Dear all,
>
>I am a total Bioperl newbie struggling to accomplish a
>conceptually simple
>task. I have a single large fasta file containing about 200,000 probe
>sequences (from an Affymetrix microarray), each of which looks
>like this:
>
>>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>Antisense;
>TGGCTCCTGCTGAGGTCCCCTTTCC
>
>What I would like to do is extract from this file a subset of ~130,800
>probes (both the header and the sequence) and output this
>subset into a new
>fasta file. These 130,800 probes correspond to 8,175 probe set IDs
>("1138_at" is the probe set ID in the header listed above); I
>have these
>8,175 IDs listed in a separate file. I *think* that I managed
>to create an
>index of all 200,000 probes in the original fasta file using
>the following
>script:
>
>#!/usr/bin/perl -w
>
> # script 1: create the index
>
> use Bio::Index::Fasta;
> use strict;
> my $Index_File_Name = shift;
> my $inx = Bio::Index::Fasta->new(
> -filename => $Index_File_Name,
> -write_flag => 1);
> $inx->make_index(@ARGV);
>
>I'm not sure if this is the most sensible approach, and even
>if it is, I'm
>not sure what to do next. Any help would be greatly appreciated!
>
>Many thanks,
>Mike O.
>
>
>
>
>--
>No virus found in this outgoing message.
>Checked by AVG Free Edition.
>Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date: 6/9/2006
More information about the Bioperl-l
mailing list