Indexing Refseq

Mon Oct 21 15:24:39 UTC 2002

> -----Original Message-----
> From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk]
> Subject: Indexing Refseq
> 
> 
> I'm having all sorts of problems working with the latest 
> release of RefSeq
>
> This means that when I run dbiflat (even using -idformat 
> REFSEQ) I get a load of warnings about duplicate entries and 
> when I later try to use the database I find that a load of 
> entries are inaccessible because of this.
> 
> For example accessions NM_134265,NM_134264 and NM_015626 all 
> have the ID WSB1.

Just to follow up to myself - I've found a temporary work-round for this problem.  The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors.  You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them.

Usage of the script is "script_name [infile] > outfile".

	TTFN

	Simon.
-------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

# This script is a filter through which we can
# pass the whole of refseq. Newer versions of
# refseq replaced their locus ID with a string
# which wasn't the accession number.  This
# just changes them back.

my ($filename) = @ARGV;

die "No filename given" unless ($filename);

my $in = Bio::SeqIO -> new(-file => $filename,
			      -format => 'genbank');

die "Couldn't read $filename" unless ($in);

my $out = Bio::SeqIO -> new(-fh => \*STDOUT,
			    -format => 'genbank');

die "Couldn't make output pipe" unless ($out);

while (my $seq = $in -> next_seq()){

  # Some NC_xxx seqs are in the Refseq file
  # but don't have any sequence attached. We'll
  # skip those files...

  next if ($seq -> accession =~ /^NC/);

  $seq -> display_id($seq-> accession());

  $out -> write_seq($seq);

}
#-------------------------------------------------------