[Bioperl-l] Retrieve FASTA seqs with NCBI definition line

Jason Stajich jason at cgt.mc.duke.edu
Mon Apr 7 11:13:38 EDT 2003


It is not really going to work perfectly in the sense that we
don't currently parse the DBSOURCE fields from genbank files.
So we won't be able to properly set the dbsource 'ref|gb|emb|dbj|pdb...'
field.

If you're willing to forgive that, then this should work.

If you really just want it right - download genpept & use Bio::DB::Flat to
index it - display_id will already be in NCBI way.

[side note - not going to help you right now]

Alternatively you can request the return format to be 'fasta' from GenPept
which will have the display id already set properly for you.  Or at least
you used to be able to - there seems to be some code preventing this from
happening and I'm not sure why.  You need to comment out the
request_format method in GenPept.pm and then add a
$db->request_format('fasta') to the code below.

[end note]

#!/usr/bin/perl -w
use strict;
use Bio::DB::GenPept;
use Bio::SeqIO;

my $default_src = 'gb';

my $db = new Bio::DB::GenPept();

my $seq = $db->get_Seq_by_acc(shift || die("provide an accession or gi
number on the cmdline");

my $out = new Bio::SeqIO;
print "before\n";
$out->write_seq($seq);
print "after\n";
seqid2NCBI($seq);
$out->write_seq($seq);

sub seqid2NCBI {
    my $seq = shift || die("Must provide a seq object");
    if( $seq->primary_id =~ /^(\d+)$/ ) {
	my $acc = $seq->accession_number || $seq->display_id;
	my $version  = $seq->can('seq_version') ? $seq->seq_version : 1;
	my $src = ( $acc =~ /^NP[PTM]/ ) ? 'ref' : $default_src;

	my $id = sprintf("gi|%d|%s|%s.%d|%s", $1, $src,
			 $acc,$version,
			 $seq->display_id);
	$seq->display_id($id);
    } else {
	print STDERR "did not provide a seq which has GI value set in
primary_id\n";
    }
    return $seq;
}

On Mon, 7 Apr 2003, Mikaela Ilinca Gabrielli wrote:

> Thanks for the clarification!
>
> I'm afraid my programming skills are still in their infancy so I make no
> promise of a new function. But I'll give it a try.
>
> Meanwhile, if anyone feels keen and capable to undertake this task and code
> a seq2NCBIfasta - function, it would be appreciated.
>
> Cheers,
>
> /Mikaela
>
> -----Original Message-----
> From: Heikki Lehvaslaiho [mailto:heikki at ebi.ac.uk]
> Sent: 07-apr-2003 14:26
> To: Mikaela Ilinca Gabrielli
> Cc: Bioperl
> Subject: Re: [Bioperl-l] Retrieve FASTA seqs with NCBI definition line
>
>
> Mikaela,
>
> I do not think there is a shortcut and I do not think modifying the the
> way fasta is processed from the sequence information is a good idea. So
> many things depend on it. You have to manually modify the return values
> of methods: display_id() and  desc().
>
> Assuming you go ahead and do it, could you put the code into a function
> (seq2NCBIfasta  ? ) which could be added into Bio::SeqUtils?
>
> Then anyone needing to do the same thing could:
>
> $out = Bio::SeqIO->new(-format => 'fasta');
> $out->write_seq(Bio::SeqUtils->seq2NCBIfasta($seq));
>
> Cheers,
> 	-Heikki
>
>
> On Mon, 2003-04-07 at 12:06, Mikaela Ilinca Gabrielli wrote:
> > Dear all,
> >
> > I'd like to retrieve sequences from GenPept that are in fasta format AND
> > include the NCBI definition line. I thought this was easy but as I apply
> > Bio::DB::GenPept I get only a part of the NCBI definition line - missing
> gi
> > and accession number information.
> >
> > ex def-line from NCBI:
> > >gi|4504379|ref|NP_003658.1| G protein-coupled receptor 49; orphan G
> > protein-coupled receptor HG38; G protein-coupled receptor 67 [Homo
> sapiens]
> >
> > ex defline retrieved through Bio:
> >
> > >GPR49 G protein-coupled receptor 49; orphan G protein-coupled receptor
> > HG38; G protein-coupled receptor 67 [Homo sapiens]
> >
> > Is there any easy way to get around this or do I have to use
> > '$seq->primary_id' and '$seq->accession_number' to "cut&paste" my own
> fasta
> > records that look like those in NCBI ?
> >
> > Best regards,
> >
> > Mikaela
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list