[Bioperl-l] BIO::DB::Genbank problem handling Contigs with db->get_Stream_by_query($query);

! Roievil ! roievil at hotmail.com
Fri Jul 9 16:32:35 EDT 2004


my $db = Bio::DB::GenBank->new;

my $stream = $db->get_Stream_by_query($query);

my query is 'arabidopsis thaliana [org] AND mads AND biomol_genomic [prop]'

where mads is the name of the family i have to study
then i iterate through features to get the exon borders where the product 
contains mads

NCBI gives me a list of arabidopsis genomic sequences in genbank format but 
when in that list there is a contig which is in the default genbank format 
instead of the full genbank format I get the following error :

-------------------- WARNING ---------------------
MSG: CONTIG found. GenBank get_Stream_by_acc about to run.
---------------------------------------------------

------------- EXCEPTION  -------------
MSG: WebDBSeqI Error - check query sequences!

STACK Bio::DB::WebDBSeqI::get_seq_stream 
C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
4
STACK Bio::DB::NCBIHelper::get_Stream_by_acc 
C:\Perl\site\lib/Bio/DB/NCBIHelper.
pm:415
STACK Bio::DB::NCBIHelper::postprocess_data 
C:\Perl\site\lib/Bio/DB/NCBIHelper.p
m:309
STACK Bio::DB::WebDBSeqI::get_seq_stream 
C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
8
STACK Bio::DB::NCBIHelper::get_Stream_by_query 
C:\Perl\site\lib/Bio/DB/NCBIHelpe
r.pm:248
STACK toplevel exonparser.pl:28

--------------------------------------

There is a note in the doc of BIO::DB::Genbank saying :

Note that when querying for GenBank accessions starting with 'NT_' you
will need to call $gb->request_format('fasta') beforehand, because
in GenBank format (the default) the sequence part will be left out
(the reason is that NT contigs are rather annotation with referencesto 
clones).
Some work has been done to automatically detect and retrieve whole NT_clones
when the data is in that format (NCBI RefSeq clones).

but it seems not to work for me maybe because the contig's accession number 
is not starting with NT_ it is  : AE005173

I don't know how to know if a sequence is a contig and how to deal with it 
(changing the format from default genbank to full genbank

Thank you, in fact, my code works for rice and wheat

(I also provide my whole code but the commentaries are in french)




I also have another minor problem : for the output of my code i write fasta 
sequences i want to provide the accession number of the product, and the 
specie and the product description.

I think I wrote the good piece of code :

                $newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
                                               -display_id => $accession,
                                               -description =>$genus . "_" 
.$species. " " . $product) ;

and if i run the script in windows it only writes the display id (not the 
description)
if i am in linux the description is also writen.

I installed bioperl in windows with perl activestate 5.6.1 and through ppm 
the bioperl 1.4 and bioperl bundle

what to update and is that update available

thank you very much

Olivier glorieux





The whole code :

#!/usr/bin/perl


use strict;
use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;

# Si la ligne de commande ne contient pas les arguments attendus
# alors on ecrit la ligne suivante a l'ecran
my $usage = "exonParser.pl genus species motif outfile (the outfile will be 
in fasta format)\n".
            "eg: perl exonParser.pl triticum aestivum mads taMads.fa\n";

# on rentre les arguments dans leurs variables respectives
my $genus = shift or die $usage;
my $species = shift or die $usage;
my $motif = shift or die $usage;
my $outfile = shift or die $usage;

# on procède à une requète sur NCBI qui va nous renvoyer toutes
# les sequences d'ADN, de l'espèce entrée contenant le motif entré
my $query =  Bio::DB::Query::GenBank->new(-query => $genus." ". 
$species."[orgn] AND ". $motif. " AND biomol_genomic[prop]",
                                          -db =>"nucleotide");
print ("test \n") ;
my $db = Bio::DB::GenBank->new;
print ("test1 \n") ;
my $stream = $db->get_Stream_by_query($query);
print ("test2 \n") ;
my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",
                              '-format' => 'fasta');
print ("test3 \n") ;
my $sequence;

# compteur du nombre de séquences
my $nbSequences = 0 ;

# pour chaque sequence du query
while ($sequence = $stream->next_seq){
    print ("\n\n Accession Séquence : ".$sequence->display_id."\n");
    #my $species =  $sequence->species;
    my @features =  $sequence->get_SeqFeatures;
    my $reverseStrand = -1;
    my $accession;
    my $product;
    # pour chaque feature du format Genbank
    for my $f (@features) {
        # pour chaque primary_tag (e.g : Gene, CDS, mRNA), si ce tag est 
égal a mRNA
        my $tag = $f->primary_tag;
        if ($tag eq ("mRNA")) {
            # pour chaque tag (e.g. : organism, gene, product)
            # on récupère le contenut du champs produit (si il existe)
            my @tags = $f->get_all_tags;
            if ($f->has_tag("product")) {
                ($product) = $f->get_tag_values("product");
            }
        }
        if ($tag eq( "CDS" ) ) {
            #print " $tag\n";
            #my $accession;
            #my $product;
            # pour chaque tag (e.g. : organism, gene, product)
            # on récupère l'ID de la protéine (si il existe)
            my @tags = $f->get_all_tags;
            if ($f->has_tag("protein_id")) {
                ($accession) = $f->get_tag_values("protein_id");
            }
            # on récupère le contenu du champs produit  (si il existe)
            if ($f->has_tag("product")) {
                ($product) = $f->get_tag_values("product");
                #print ($product."\n") ;
            }
            # si le produit contient le motif en minuscules ou majuscules
            if ($product =~ /$motif/i) {
                print ("        Produit : ".$product."\n") ;
                print ("        numéro d'accession produit : 
".$accession."\n") ;
                my $seqStart = $f->start;
                my $seqEnd = $f->end;
                my $strand = $f->strand;
                my $newSeq;
                # on récupère la séquence et on la met en minuscules
                my $subSeq = lc($sequence->subseq($seqStart, $seqEnd));
                # pour chaque Location on récupère les sous-locations
                my $complex_location = $f->location;
                my @sublocations = $complex_location->each_Location;
                # pour chaque sous-location on récupère la sous-séquence de 
l'exon
                # puis on la met en majuscules
                for my $sl (@sublocations) {
                    my $slStart = $sl->start - $seqStart;
                    my $slEnd = $sl->end - $seqStart;

                    my $ucportion = uc( substr ( $subSeq, $slStart, $slEnd - 
$slStart + 1 ) );
                    substr( $subSeq, $slStart, $slEnd - $slStart + 1, 
$ucportion );

                }
                # print ("test" .$subSeq. "\n") ;
                # On crée une nouvelle séquence format Fasta
                $newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
                                               -display_id => $accession,
                                               -description =>$genus . "_" 
.$species. " " . $product) ;
                if ($strand == $reverseStrand) {
                    $newSeq = $newSeq->revcom;
                }
                # on incrémente le nombre de séquences
                # print ("test \n") ;
                $nbSequences++ ;
                # ecrit chaque entree dans le fichier de sortie
                $seq_out->write_seq($newSeq);
            }
        }
    }
}
print("nombre de séquences  : ".$nbSequences) ;
exit;

_________________________________________________________________
Tired of spam? Get advanced junk mail protection with MSN 8. 
http://join.msn.com/?page=features/junkmail



More information about the Bioperl-l mailing list