[Bioperl-l] Output a subset of FASTA data from a singlelargefile
Sean Davis
sdavis2 at mail.nih.gov
Sat Jun 24 14:45:52 UTC 2006
----- Original Message -----
From: "Michael Oldham" <oldham at ucla.edu>
To: "Cook, Malcolm" <MEC at stowers-institute.org>; "Chris Fields"
<cjfields at uiuc.edu>
Cc: <bioperl-l at lists.open-bio.org>
Sent: Friday, June 23, 2006 12:18 PM
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
singlelargefile
> Hello again,
>
> I finally got it to work, using the following script. However, it takes
> about 5 hours to run on a fast computer. Using grep (in bash), on the
> other hand, takes about 5 minutes (see below if you are interested).
> Thanks to everyone for your help!
>
> SLOW perl script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> my $IDs = 'ID_all_X';
>
> unless (open(IDFILE, $IDs)) {
> print "Could not open file $IDs!\n";
> }
>
> my $probes = 'HG_U95Av2_probe_fasta';
>
> unless (open(PROBES, $probes)) {
> print "Could not open file $probes!\n";
> }
>
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
> my @ID = <IDFILE>;
> print @ID;
> chomp @ID;
>
> while (my $line = <PROBES>) {
> foreach my $identifier (@ID) {
> if($line=~/^>probe:\w+:$identifier:/) {
> print OUT $line;
> print OUT scalar(<PROBES>);
> }
> }
> }
This could probably be done MUCH faster using a hash on the sequence
identifier. (I have to admit that I didn't follow the first part of this
conversation, so I could be misunderstanding some part of what you are
trying to do.) If you have a couple hundred-thousand sequences, my guess is
that it could be done in under 30 seconds, but I could be wrong about the
exact time. The important part is to make a hash of your sequences with the
key being the $identifier. Then, loop through your @ID array doing
something like (untested):
#open files as before and read in @ID as before
my %seq_hash;
while (my $line = <PROBES>) {
if ($line =~/^>probe:\w+:$identifier:/) {
$seq_hash{$identifier}=<PROBES>;
}
}
foreach my $id (@ID) {
print OUT ">$id\n" . $seq_hash{$id};
}
More information about the Bioperl-l
mailing list