[Bioperl-l] Output a subset of FASTA data from a single large file
simon andrews (BI)
simon.andrews at bbsrc.ac.uk
Fri Jun 9 15:01:05 UTC 2006
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
> Michael Oldham
> Sent: 09 June 2006 03:08
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Output a subset of FASTA data from a
> single large file
>
> Dear all,
>
> I am a total Bioperl newbie struggling to accomplish a
> conceptually simple task. I have a single large fasta file
> containing about 200,000 probe sequences (from an Affymetrix
> microarray), each of which looks like this:
>
> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
> >Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'. I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.
> What I would like to do is extract from this file a subset of
> ~130,800 probes (both the header and the sequence) and output
> this subset into a new fasta file. These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is the probe set
> ID in the header listed above)
If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this. Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;
#!perl
use warnings;
use strict;
my %probe_ids;
# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################
open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";
open (OUT,'>','probe_list.txt') or die "Can't write output: $!";
while (<IN>) {
if (/^>probe/) {
# This assumes there are always 3 lines per probe entry
if (exists $probe_ids{(split(/:/))[2]}) {
print OUT;
print OUT scalar <IN>;
print OUT scalar <IN>;
}
}
}
More information about the Bioperl-l
mailing list