[Bioperl-l] Output a subset of FASTA data from a single largefile
Cook, Malcolm
MEC at stowers-institute.org
Mon Jun 12 21:15:35 UTC 2006
Yeah, good points...
... my recommendation of the one-liner was motivated based on a small
number of IDs and no other applications needing to index the entire
fasta database.
--Malcolm [At which point he bowed out of this fray]
>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
>Sent: Monday, June 12, 2006 3:35 PM
>To: 'Sendu Bala'; bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
>single largefile
>
>...
>> Chris Fields wrote:
>> > There's information in the HOWTOs:
>> >
>> > http://www.bioperl.org/wiki/HOWTO:Flat_databases
>> >
>> > http://www.bioperl.org/wiki/HOWTO:OBDA
>> >
>...
>> As you later discovered, that was an Outlook problem. Just
>to make this
>> thread relevant to bioperl, the bioperl solution is:
>
>Agreed (stupid Outlook). It might be much faster to use
>non-Bioperl-ish
>ways, but it is easier to further manipulate sequences (convert format,
>analyze sequences, etc) using Bioperl directly. I haven't used flat
>databases much but it should move very quickly, even in an OO
>environment.
>
>The one problem with the proposed non-bioperl method is, if you wanted
>100,000 sequences (based on ID's) in a FASTA database file containing
>200,000 sequences, all ID's would need to be stored (1) in an
>array (which
>gulped the data from the ID file) and then map the ID's to (2) a hash;
>that's may be a pretty big memory footprint depending on your system.
>
>Sendu's BioPerl version indexes the FASTA file based on the
>ID, then (1)
>reads the ID's in one at a time from the file, (2) retrieves
>the data, then
>(3) prints it out. The advantage of this approach is that
>the built index
>can be used in other bioperl scripts as well w/o having to
>rebuild it again,
>so if you wanted a different set of ID's later on you can access the
>database using the prebuilt index. More can be found in the
>Bio::Index::Fasta POD.
>
>You can also use the ideas and code in the HOWTO (Flat Databases) I
>mentioned, which focuses on the Bio::DB::Flat system and ODBA. The
>advantage of these is that you can use Sleepycat's Berkeley
>Database through
>the Perl BerkeleyDB module (more functionality than DB_File)
>which is faster
>than a standard flat database. In the HOWTO, specifically look under
>'Secondary or custom namespaces' for ideas on how to use your ID as a
>primary or secondary key.
>
>Chris
>
>> use Bio::SeqIO;
>> use Bio::Index::Fasta;
>> my $inx = Bio::Index::Fasta->new(-write_flag => 1);
>> $inx->id_parser(\&get_id);
>> $inx->make_index(shift);
>>
>> my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
>> my $wanted_ids_file = shift;
>> open(IDS, $wanted_ids_file);
>> while (<IDS>) {
>> chomp;
>> my $seq = $inx->fetch($_);
>> $out->write_seq($seq);
>> }
>>
>> sub get_id {
>> my $line = shift;
>> $line =~ /^>probe:\S+?:(\S+?):/;
>> $1;
>> }
>>
>> It works for me on the sample sequence given by the OP.
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list