[Bioperl-l] BIO::DB::Genbank problem handling Contigs with
db->get_Stream_by_query($query);
! Roievil !
roievil at hotmail.com
Fri Jul 9 16:32:35 EDT 2004
my $db = Bio::DB::GenBank->new;
my $stream = $db->get_Stream_by_query($query);
my query is 'arabidopsis thaliana [org] AND mads AND biomol_genomic [prop]'
where mads is the name of the family i have to study
then i iterate through features to get the exon borders where the product
contains mads
NCBI gives me a list of arabidopsis genomic sequences in genbank format but
when in that list there is a contig which is in the default genbank format
instead of the full genbank format I get the following error :
-------------------- WARNING ---------------------
MSG: CONTIG found. GenBank get_Stream_by_acc about to run.
---------------------------------------------------
------------- EXCEPTION -------------
MSG: WebDBSeqI Error - check query sequences!
STACK Bio::DB::WebDBSeqI::get_seq_stream
C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
4
STACK Bio::DB::NCBIHelper::get_Stream_by_acc
C:\Perl\site\lib/Bio/DB/NCBIHelper.
pm:415
STACK Bio::DB::NCBIHelper::postprocess_data
C:\Perl\site\lib/Bio/DB/NCBIHelper.p
m:309
STACK Bio::DB::WebDBSeqI::get_seq_stream
C:\Perl\site\lib/Bio/DB/WebDBSeqI.pm:46
8
STACK Bio::DB::NCBIHelper::get_Stream_by_query
C:\Perl\site\lib/Bio/DB/NCBIHelpe
r.pm:248
STACK toplevel exonparser.pl:28
--------------------------------------
There is a note in the doc of BIO::DB::Genbank saying :
Note that when querying for GenBank accessions starting with 'NT_' you
will need to call $gb->request_format('fasta') beforehand, because
in GenBank format (the default) the sequence part will be left out
(the reason is that NT contigs are rather annotation with referencesto
clones).
Some work has been done to automatically detect and retrieve whole NT_clones
when the data is in that format (NCBI RefSeq clones).
but it seems not to work for me maybe because the contig's accession number
is not starting with NT_ it is : AE005173
I don't know how to know if a sequence is a contig and how to deal with it
(changing the format from default genbank to full genbank
Thank you, in fact, my code works for rice and wheat
(I also provide my whole code but the commentaries are in french)
I also have another minor problem : for the output of my code i write fasta
sequences i want to provide the accession number of the product, and the
specie and the product description.
I think I wrote the good piece of code :
$newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
-display_id => $accession,
-description =>$genus . "_"
.$species. " " . $product) ;
and if i run the script in windows it only writes the display id (not the
description)
if i am in linux the description is also writen.
I installed bioperl in windows with perl activestate 5.6.1 and through ppm
the bioperl 1.4 and bioperl bundle
what to update and is that update available
thank you very much
Olivier glorieux
The whole code :
#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
# Si la ligne de commande ne contient pas les arguments attendus
# alors on ecrit la ligne suivante a l'ecran
my $usage = "exonParser.pl genus species motif outfile (the outfile will be
in fasta format)\n".
"eg: perl exonParser.pl triticum aestivum mads taMads.fa\n";
# on rentre les arguments dans leurs variables respectives
my $genus = shift or die $usage;
my $species = shift or die $usage;
my $motif = shift or die $usage;
my $outfile = shift or die $usage;
# on procède à une requète sur NCBI qui va nous renvoyer toutes
# les sequences d'ADN, de l'espèce entrée contenant le motif entré
my $query = Bio::DB::Query::GenBank->new(-query => $genus." ".
$species."[orgn] AND ". $motif. " AND biomol_genomic[prop]",
-db =>"nucleotide");
print ("test \n") ;
my $db = Bio::DB::GenBank->new;
print ("test1 \n") ;
my $stream = $db->get_Stream_by_query($query);
print ("test2 \n") ;
my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",
'-format' => 'fasta');
print ("test3 \n") ;
my $sequence;
# compteur du nombre de séquences
my $nbSequences = 0 ;
# pour chaque sequence du query
while ($sequence = $stream->next_seq){
print ("\n\n Accession Séquence : ".$sequence->display_id."\n");
#my $species = $sequence->species;
my @features = $sequence->get_SeqFeatures;
my $reverseStrand = -1;
my $accession;
my $product;
# pour chaque feature du format Genbank
for my $f (@features) {
# pour chaque primary_tag (e.g : Gene, CDS, mRNA), si ce tag est
égal a mRNA
my $tag = $f->primary_tag;
if ($tag eq ("mRNA")) {
# pour chaque tag (e.g. : organism, gene, product)
# on récupère le contenut du champs produit (si il existe)
my @tags = $f->get_all_tags;
if ($f->has_tag("product")) {
($product) = $f->get_tag_values("product");
}
}
if ($tag eq( "CDS" ) ) {
#print " $tag\n";
#my $accession;
#my $product;
# pour chaque tag (e.g. : organism, gene, product)
# on récupère l'ID de la protéine (si il existe)
my @tags = $f->get_all_tags;
if ($f->has_tag("protein_id")) {
($accession) = $f->get_tag_values("protein_id");
}
# on récupère le contenu du champs produit (si il existe)
if ($f->has_tag("product")) {
($product) = $f->get_tag_values("product");
#print ($product."\n") ;
}
# si le produit contient le motif en minuscules ou majuscules
if ($product =~ /$motif/i) {
print (" Produit : ".$product."\n") ;
print (" numéro d'accession produit :
".$accession."\n") ;
my $seqStart = $f->start;
my $seqEnd = $f->end;
my $strand = $f->strand;
my $newSeq;
# on récupère la séquence et on la met en minuscules
my $subSeq = lc($sequence->subseq($seqStart, $seqEnd));
# pour chaque Location on récupère les sous-locations
my $complex_location = $f->location;
my @sublocations = $complex_location->each_Location;
# pour chaque sous-location on récupère la sous-séquence de
l'exon
# puis on la met en majuscules
for my $sl (@sublocations) {
my $slStart = $sl->start - $seqStart;
my $slEnd = $sl->end - $seqStart;
my $ucportion = uc( substr ( $subSeq, $slStart, $slEnd -
$slStart + 1 ) );
substr( $subSeq, $slStart, $slEnd - $slStart + 1,
$ucportion );
}
# print ("test" .$subSeq. "\n") ;
# On crée une nouvelle séquence format Fasta
$newSeq = Bio::PrimarySeq->new(-seq => $subSeq,
-display_id => $accession,
-description =>$genus . "_"
.$species. " " . $product) ;
if ($strand == $reverseStrand) {
$newSeq = $newSeq->revcom;
}
# on incrémente le nombre de séquences
# print ("test \n") ;
$nbSequences++ ;
# ecrit chaque entree dans le fichier de sortie
$seq_out->write_seq($newSeq);
}
}
}
}
print("nombre de séquences : ".$nbSequences) ;
exit;
_________________________________________________________________
Tired of spam? Get advanced junk mail protection with MSN 8.
http://join.msn.com/?page=features/junkmail
More information about the Bioperl-l
mailing list