Indexing Refseq
simon andrews (BI)
simon.andrews at bbsrc.ac.uk
Mon Oct 21 15:24:39 UTC 2002
> -----Original Message-----
> From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk]
> Subject: Indexing Refseq
>
>
> I'm having all sorts of problems working with the latest
> release of RefSeq
>
> This means that when I run dbiflat (even using -idformat
> REFSEQ) I get a load of warnings about duplicate entries and
> when I later try to use the database I find that a load of
> entries are inaccessible because of this.
>
> For example accessions NM_134265,NM_134264 and NM_015626 all
> have the ID WSB1.
Just to follow up to myself - I've found a temporary work-round for this problem. The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors. You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them.
Usage of the script is "script_name [infile] > outfile".
TTFN
Simon.
-------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
# This script is a filter through which we can
# pass the whole of refseq. Newer versions of
# refseq replaced their locus ID with a string
# which wasn't the accession number. This
# just changes them back.
my ($filename) = @ARGV;
die "No filename given" unless ($filename);
my $in = Bio::SeqIO -> new(-file => $filename,
-format => 'genbank');
die "Couldn't read $filename" unless ($in);
my $out = Bio::SeqIO -> new(-fh => \*STDOUT,
-format => 'genbank');
die "Couldn't make output pipe" unless ($out);
while (my $seq = $in -> next_seq()){
# Some NC_xxx seqs are in the Refseq file
# but don't have any sequence attached. We'll
# skip those files...
next if ($seq -> accession =~ /^NC/);
$seq -> display_id($seq-> accession());
$out -> write_seq($seq);
}
#-------------------------------------------------------
More information about the EMBOSS
mailing list