[Bioperl-l] acquiring a local refseq + index

Erik er at xs4all.nl
Sun Dec 31 00:05:16 UTC 2006


Hi all,

I downloaded the refseq files (.gbff) and want to index the lot with
Bio::DB::Flat.

It turns out that there are many cases where the SOURCE and ORGANISM lines
are messed up, sometimes to a degree where the indexing fails on a
Bio::SeqIO::genbank error.

I'd like to change Bio::SeqIO::genbank to let this parsing go at least so
far as to make the indexing of the refseq files possible, and hopefully
improving the taxonomic output ($seq->species->binomial is often mutilated
at the moment).

Is it still worthwhile to change parsing modules like Bio::SeqIO::genbank?
 Is anyone already working on a rewrite? Because if this is the case I may
be better off writing my own indexing scheme?

Below is (outline of) my indexing program, which uses Bio::DB::Flat::DBD.
If anyone knows of a better way to get a locally searchable refseq flat
file index, I would be very interested.

Thanks for your help,

Erikjan


-------------
use Bio::DB::Flat;

my $refseq_dir = '/data/ftp.ncbi.nih.gov/refseq/release/complete';
my $db=Bio::DB::Flat->new(
   -directory  => $refseq_dir,
   -dbname     => 'refseq',
   -format     => 'genbank',
   -index      => 'bdb',
   -write_flag => 1,
);
my @files = getfiles($refseq_dir);
for my $f (@files) {
        db->build_index($f);
}





More information about the Bioperl-l mailing list