[Bioperl-l] Output a subset of FASTA data from a single largefile

Chris Fields cjfields at uiuc.edu
Mon Jun 12 19:13:30 UTC 2006


Michael, Malcolm et al,

I ran Michael's code (not Malcolm's one-liner), with and w/o adding the file
handle line that I suggested.  My suggestion works b/c I'm calling the file
handle in scalar context, which reads the next line, just like '$foo =
<FILE>' or 'while(<FILE>) {}' advances to the next line (with $/ = "\n")
each time the file handle is called.  You could use:

$_ = <PROBES>;
print OUT;

I just chopped it down to one line.

Without the extra line I suggested I get only the description line (I used
this as a test file based on the original sequence and Michael's description
of the ID):

>probe:HG_U95Av2:1138_at:395:301;Interrogation_Position=2631;Antisense;
>probe:HG_U95Av2:1267_at:395:301;Interrogation_Position=2631;Antisense;
>probe:HG_U95Av2:1500_at:395:301;Interrogation_Position=2631;Antisense;
>probe:HG_U95Av2:2037_at:395:301;Interrogation_Position=2631;Antisense;
>probe:HG_U95Av2:3289390128_at:395:301;Interrogation_Position=2631;Antisense
;

Which I don't think Michael wants (he mentioned sequence and description, I
think).  

Modifying the loop in Michael's code to:
...

while (<PROBES>) {
	my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
	if ($idmatch){
		print OUT;
		print OUT <PROBES>; # grabs next line and prints
	}
}

Gets:

>probe:HG_U95Av2:1138_at:395:301;Interrogation_Position=2631;Antisense;
AGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:1267_at:395:301;Interrogation_Position=2631;Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:1500_at:395:301;Interrogation_Position=2631;Antisense;
TGGCTCCTGCTGAGGTCCCCTATCC
>probe:HG_U95Av2:2037_at:395:301;Interrogation_Position=2631;Antisense;
TGGATCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:3289390128_at:395:301;Interrogation_Position=2631;Antisense
;
TGGCTACTGCTGAGGTCCCCTTTCC

Which matches the ID's in the ID file (there are 10 sequences in the probes
file).  

I did notice one odd thing; I tried the above code on Mac OS X and it worked
fine (i.e. printed only the descriptions and sequences for the ID's in the
ID hash).  If I used Windows, I needed to use this version:

while (<PROBES>) {
	my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
	if ($idmatch){
		print OUT;
		print OUT scalar(<PROBES>);		
	}
}

Or 'print <PROBES>;' prints all sequences (I guess it assumes list context
instead of scalar context when printing, so this forces it to be scalar).

Like I said, I haven't tried Malcolm's one-liner.  It's possible that it
works just as well as what I suggested.  I'm just responding to Michael's
code request.

Chris



> -----Original Message-----
> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
> Sent: Monday, June 12, 2006 10:48 AM
> To: Cook, Malcolm; Chris Fields; Michael Oldham
> Cc: bioperl-l at lists.open-bio.org
> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single
> largefile
> 
> oops,
> 
> s/matches on of/matches one of/
> s/nothing that/noting that/
> 
> --Malcolm
> 
> 
> >-----Original Message-----
> >From: bioperl-l-bounces at lists.open-bio.org
> >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
> >Cook, Malcolm
> >Sent: Monday, June 12, 2006 10:29 AM
> >To: Chris Fields; Michael Oldham
> >Cc: bioperl-l at lists.open-bio.org
> >Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
> >single largefile
> >
> >Michael,
> >
> >I don't think you can call perl's `print` on just a filehandle as you
> >are doing.  This is probably your problem.
> >
> >If you call `select OUT` after opeining it, print will print $_ to it.
> >And, every line in the fasta record whose header matches on of the IDS
> >will get printed, not just the fasta header lines.  Read the code again
> >nothing that $idmatch is only getting reset when a correctly formatted
> >fasta header line is matched.
> >
> >--Malcolm
> >
> >
> >>-----Original Message-----
> >>From: Chris Fields [mailto:cjfields at uiuc.edu]
> >>Sent: Saturday, June 10, 2006 11:32 PM
> >>To: Michael Oldham
> >>Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
> >>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
> >>single large file
> >>
> >>What happens if you just print $idmatch or $1 (i.e. check to see if
> >>the regex matches anything)?  If there is nothing printed
> >then either
> >>the regex isn't working as expected or there is something logically
> >>wrong.  The problem may be that the captured string must
> >match the id
> >>exactly, the id being the key to the %ID hash; any extra characters
> >>picked up by the regex outside of your id key and you will not get
> >>anything.  Looking at Malcolm's regex it should work just fine, but
> >>we only had one example sequence to try here.
> >>
> >>If your while loop is set up like this won't it only print only the
> >>matched description lines to the outfile (no sequence) even if there
> >>is a match?  Or is this what you wanted?   If you want the sequence
> >>you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
> >>
> >>Chris
> >>
> >>On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:
> >>
> >>> Thanks to everyone for their helpful advice.  I think I am getting
> >>> closer,
> >>> but no cigar quite yet.  The script below runs quickly with no
> >>> errors--but
> >>> the output file is empty.  It seems that the problem must lie
> >>> somewhere in
> >>> the 'while' loop, and I'm sure it's quite obvious to a more
> >>> experienced
> >>> eye--but not to mine!  Any suggestions?  Thanks again for your help.
> >>>
> >>> --Mike O.
> >>>
> >>>
> >>> #!/usr/bin/perl -w
> >>>
> >>> use strict;
> >>>
> >>> my $IDs = 'ID.dat.txt';
> >>>
> >>> unless (open(IDFILE, $IDs)) {
> >>> 	print "Could not open file $IDs!\n";
> >>> 	}
> >>>
> >>> my $probes = 'HG_U95Av2_probe_fasta.txt';
> >>>
> >>> unless (open(PROBES, $probes)) {
> >>> 	print "Could not open file $probes!\n";
> >>> 	}
> >>>
> >>> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
> >>>
> >>> my @ID = <IDFILE>;
> >>> chomp @ID;
> >>> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with
> >>> keys=PSIDs and
> >>> all values=1.
> >>>
> >>> 	while (<PROBES>) {
> >>> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
> >>> 		if ($idmatch){
> >>> 			print OUT;
> >>> 		}
> >>> 	}
> >>> exit;
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
> >>> Sent: Friday, June 09, 2006 7:58 AM
> >>> To: Michael Oldham; bioperl-l at lists.open-bio.org
> >>> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
> >>> single large
> >>> file
> >>>
> >>>
> >>>
> >>> I wouldn't bioperl for this, or create an index.  Perl would do
> >>> fine and
> >>> probably be faster.
> >>>
> >>> Assuming your ids are one per line in a file named id.dat
> >>looking like
> >>> this
> >>>
> >>> 1138_at
> >>> 1134_at
> >>> etc..
> >>>
> >>> this should work:
> >>>
> >>> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
> >>> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
> >>> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
> >>> mybigfile.fa
> >>>
> >>> good luck
> >>>
> >>> --Malcolm Cook
> >>>
> >>>> -----Original Message-----
> >>>> From: bioperl-l-bounces at lists.open-bio.org
> >>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
> >>>> Michael Oldham
> >>>> Sent: Thursday, June 08, 2006 9:08 PM
> >>>> To: bioperl-l at lists.open-bio.org
> >>>> Subject: [Bioperl-l] Output a subset of FASTA data from a
> >>>> single large file
> >>>>
> >>>> Dear all,
> >>>>
> >>>> I am a total Bioperl newbie struggling to accomplish a
> >>>> conceptually simple
> >>>> task.  I have a single large fasta file containing about 200,000
> >>>> probe
> >>>> sequences (from an Affymetrix microarray), each of which looks
> >>>> like this:
> >>>>
> >>>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
> >>>> Antisense;
> >>>> TGGCTCCTGCTGAGGTCCCCTTTCC
> >>>>
> >>>> What I would like to do is extract from this file a subset of
> >>>> ~130,800
> >>>> probes (both the header and the sequence) and output this
> >>>> subset into a new
> >>>> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
> >>>> ("1138_at" is the probe set ID in the header listed above); I
> >>>> have these
> >>>> 8,175 IDs listed in a separate file.  I *think* that I managed
> >>>> to create an
> >>>> index of all 200,000 probes in the original fasta file using
> >>>> the following
> >>>> script:
> >>>>
> >>>> #!/usr/bin/perl -w
> >>>>
> >>>> # script 1: create the index
> >>>>
> >>>> use Bio::Index::Fasta;
> >>>> use strict;
> >>>> my $Index_File_Name = shift;
> >>>> my $inx = Bio::Index::Fasta->new(
> >>>>     -filename => $Index_File_Name,
> >>>>     -write_flag => 1);
> >>>> $inx->make_index(@ARGV);
> >>>>
> >>>> I'm not sure if this is the most sensible approach, and even
> >>>> if it is, I'm
> >>>> not sure what to do next.  Any help would be greatly appreciated!
> >>>>
> >>>> Many thanks,
> >>>> Mike O.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> No virus found in this outgoing message.
> >>>> Checked by AVG Free Edition.
> >>>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
> >>>> 6/8/2006
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>> --
> >>> No virus found in this incoming message.
> >>> Checked by AVG Free Edition.
> >>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:
> >>> 6/8/2006
> >>>
> >>> --
> >>> No virus found in this outgoing message.
> >>> Checked by AVG Free Edition.
> >>> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:
> >>> 6/9/2006
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>Christopher Fields
> >>Postdoctoral Researcher
> >>Lab of Dr. Robert Switzer
> >>Dept of Biochemistry
> >>University of Illinois Urbana-Champaign
> >>
> >>
> >>
> >>
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at lists.open-bio.org
> >http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >




More information about the Bioperl-l mailing list