[Bioperl-l] BIO::DB::Genbank problem handling Contigs with db->get_Stream_by_query($query);

Lincoln Stein lstein at cshl.edu
Fri Jul 9 17:02:47 EDT 2004


Can you try bioperl-live in CVS?  The query module has changed recently.

Lincoln

On Friday 09 July 2004 04:32 pm, ! Roievil ! wrote:
> my $db = Bio::DB::GenBank->new;
>
> my $stream = $db->get_Stream_by_query($query);
>
> my query is 'arabidopsis thaliana [org] AND mads AND biomol_genomic [prop]'
>
> where mads is the name of the family i have to study
> then i iterate through features to get the exon borders where the product
> contains mads
>
> NCBI gives me a list of arabidopsis genomic sequences in genbank format but
> when in that list there is a contig which is in the default genbank format
> instead of the full genbank format I get the following error :
>
> -------------------- WARNING ---------------------
> MSG: CONTIG found. GenBank get_Stream_by_acc about to run.
> ---------------------------------------------------
>
> ------------- EXCEPTION  -------------
> MSG: WebDBSeqI Error - check query sequences!
>
> STACK Bio::DB::WebDBSeqI::get_seq_stream
> C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
> 4
> STACK Bio::DB::NCBIHelper::get_Stream_by_acc
> C:\Perl\site\lib/Bio/DB/NCBIHelper.
> pm:415
> STACK Bio::DB::NCBIHelper::postprocess_data
> C:\Perl\site\lib/Bio/DB/NCBIHelper.p
> m:309
> STACK Bio::DB::WebDBSeqI::get_seq_stream
> C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
> 8
> STACK Bio::DB::NCBIHelper::get_Stream_by_query
> C:\Perl\site\lib/Bio/DB/NCBIHelpe
> r.pm:248
> STACK toplevel exonparser.pl:28
>
> --------------------------------------
>
> There is a note in the doc of BIO::DB::Genbank saying :
>
> Note that when querying for GenBank accessions starting with 'NT_' you
> will need to call $gb->request_format('fasta') beforehand, because
> in GenBank format (the default) the sequence part will be left out
> (the reason is that NT contigs are rather annotation with referencesto
> clones).
> Some work has been done to automatically detect and retrieve whole
> NT_clones when the data is in that format (NCBI RefSeq clones).
>
> but it seems not to work for me maybe because the contig's accession number
> is not starting with NT_ it is  : AE005173
>
> I don't know how to know if a sequence is a contig and how to deal with it
> (changing the format from default genbank to full genbank
>
> Thank you, in fact, my code works for rice and wheat
>
> (I also provide my whole code but the commentaries are in french)
>
>
>
>
> I also have another minor problem : for the output of my code i write fasta
> sequences i want to provide the accession number of the product, and the
> specie and the product description.
>
> I think I wrote the good piece of code :
>
>                 $newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
>                                                -display_id => $accession,
>                                                -description =>$genus . "_"
> .$species. " " . $product) ;
>
> and if i run the script in windows it only writes the display id (not the
> description)
> if i am in linux the description is also writen.
>
> I installed bioperl in windows with perl activestate 5.6.1 and through ppm
> the bioperl 1.4 and bioperl bundle
>
> what to update and is that update available
>
> thank you very much
>
> Olivier glorieux
>
>
>
>
>
> The whole code :
>
> #!/usr/bin/perl
>
>
> use strict;
> use Bio::SeqIO;
> use Bio::Seq;
> use Bio::DB::GenBank;
> use Bio::DB::Query::GenBank;
>
> # Si la ligne de commande ne contient pas les arguments attendus
> # alors on ecrit la ligne suivante a l'ecran
> my $usage = "exonParser.pl genus species motif outfile (the outfile will be
> in fasta format)\n".
>             "eg: perl exonParser.pl triticum aestivum mads taMads.fa\n";
>
> # on rentre les arguments dans leurs variables respectives
> my $genus = shift or die $usage;
> my $species = shift or die $usage;
> my $motif = shift or die $usage;
> my $outfile = shift or die $usage;
>
> # on procède à une requète sur NCBI qui va nous renvoyer toutes
> # les sequences d'ADN, de l'espèce entrée contenant le motif entré
> my $query =  Bio::DB::Query::GenBank->new(-query => $genus." ".
> $species."[orgn] AND ". $motif. " AND biomol_genomic[prop]",
>                                           -db =>"nucleotide");
> print ("test \n") ;
> my $db = Bio::DB::GenBank->new;
> print ("test1 \n") ;
> my $stream = $db->get_Stream_by_query($query);
> print ("test2 \n") ;
> my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",
>                               '-format' => 'fasta');
> print ("test3 \n") ;
> my $sequence;
>
> # compteur du nombre de séquences
> my $nbSequences = 0 ;
>
> # pour chaque sequence du query
> while ($sequence = $stream->next_seq){
>     print ("\n\n Accession Séquence : ".$sequence->display_id."\n");
>     #my $species =  $sequence->species;
>     my @features =  $sequence->get_SeqFeatures;
>     my $reverseStrand = -1;
>     my $accession;
>     my $product;
>     # pour chaque feature du format Genbank
>     for my $f (@features) {
>         # pour chaque primary_tag (e.g : Gene, CDS, mRNA), si ce tag est
> égal a mRNA
>         my $tag = $f->primary_tag;
>         if ($tag eq ("mRNA")) {
>             # pour chaque tag (e.g. : organism, gene, product)
>             # on récupère le contenut du champs produit (si il existe)
>             my @tags = $f->get_all_tags;
>             if ($f->has_tag("product")) {
>                 ($product) = $f->get_tag_values("product");
>             }
>         }
>         if ($tag eq( "CDS" ) ) {
>             #print " $tag\n";
>             #my $accession;
>             #my $product;
>             # pour chaque tag (e.g. : organism, gene, product)
>             # on récupère l'ID de la protéine (si il existe)
>             my @tags = $f->get_all_tags;
>             if ($f->has_tag("protein_id")) {
>                 ($accession) = $f->get_tag_values("protein_id");
>             }
>             # on récupère le contenu du champs produit  (si il existe)
>             if ($f->has_tag("product")) {
>                 ($product) = $f->get_tag_values("product");
>                 #print ($product."\n") ;
>             }
>             # si le produit contient le motif en minuscules ou majuscules
>             if ($product =~ /$motif/i) {
>                 print ("        Produit : ".$product."\n") ;
>                 print ("        numéro d'accession produit :
> ".$accession."\n") ;
>                 my $seqStart = $f->start;
>                 my $seqEnd = $f->end;
>                 my $strand = $f->strand;
>                 my $newSeq;
>                 # on récupère la séquence et on la met en minuscules
>                 my $subSeq = lc($sequence->subseq($seqStart, $seqEnd));
>                 # pour chaque Location on récupère les sous-locations
>                 my $complex_location = $f->location;
>                 my @sublocations = $complex_location->each_Location;
>                 # pour chaque sous-location on récupère la sous-séquence de
> l'exon
>                 # puis on la met en majuscules
>                 for my $sl (@sublocations) {
>                     my $slStart = $sl->start - $seqStart;
>                     my $slEnd = $sl->end - $seqStart;
>
>                     my $ucportion = uc( substr ( $subSeq, $slStart, $slEnd
> - $slStart + 1 ) );
>                     substr( $subSeq, $slStart, $slEnd - $slStart + 1,
> $ucportion );
>
>                 }
>                 # print ("test" .$subSeq. "\n") ;
>                 # On crée une nouvelle séquence format Fasta
>                 $newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
>                                                -display_id => $accession,
>                                                -description =>$genus . "_"
> .$species. " " . $product) ;
>                 if ($strand == $reverseStrand) {
>                     $newSeq = $newSeq->revcom;
>                 }
>                 # on incrémente le nombre de séquences
>                 # print ("test \n") ;
>                 $nbSequences++ ;
>                 # ecrit chaque entree dans le fichier de sortie
>                 $seq_out->write_seq($newSeq);
>             }
>         }
>     }
> }
> print("nombre de séquences  : ".$nbSequences) ;
> exit;
>
> _________________________________________________________________
> Tired of spam? Get advanced junk mail protection with MSN 8.
> http://join.msn.com/?page=features/junkmail
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)



More information about the Bioperl-l mailing list