[Bioperl-l] [Structure of remote GenBank files]
Sebastien Moretti
Sebastien.Moretti at igs.cnrs-mrs.fr
Thu Apr 22 03:15:57 EDT 2004
Hello
I use a BioPerl script to get GenBank and RefSeq files in GenBank flat file
format.
#!/usr/bin/perl -w
use strict;
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
use Bio::SeqIO;
my $acc=$ARGV[0] or die "\n\tThe accession number you seek for is
missing.\n\tTry something like: ./update_estCDK.pl NM_178432\n\n";
$acc=$acc."[Accession]";
my $query_string = "$acc";
my $query = Bio::DB::Query::GenBank->new(-db=>'nucleotide',
-query=>$query_string);
my $gb = new Bio::DB::GenBank;
my $stream = $gb->get_Stream_by_query($query);
my $out=Bio::SeqIO->new(-format=>'genbank');
my $seq = $stream->next_seq();
my $result=$out->write_seq($seq);
$result =~ s/^1.*$//;
#print $out->write_seq($seq);
print $result;
exit;
It works fine but I have two structures problems in my files:
- the PUBMED fields are pasted with the JOURNAL fields line above:
JOURNAL J. Biol. Chem. 278 (42), 40815-40828 (2003) PUBMED 12912980
or
JOURNAL J. Cancer Res. Clin. Oncol. 129 (9), 498-502 (2003) PUBMED
12884029
or
JOURNAL Am. J. Physiol. Heart Circ. Physiol. 284 (6), H1917-H1923 (2003)
PUBMED 12742823
- the COMMENT fields haven't blank lines and \n, so COMMENT fields looks
compact:
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from Y00272.1 and BC014563.1. On
Oct 22, 2001 this sequence version replaced gi:4502708. Summary:
The protein encoded by this gene is a member of the Ser/Thr
protein kinase family. This protein is a catalytic subunit of the
highly conserved protein kinase complex known as M-phase promoting
factor (MPF), which is essential for G1/S and G2/M phase
transitions of eukaryotic cell cycle. Mitotic cyclins stably
associate with this protein and function as regulatory subunits.
The kinase activity of this protein is controlled by cyclin
accumulation and destruction through the cell cycle. The
phosphorylation and dephosphorylation of this protein also play
important regulatory roles in cell cycle control. Transcript
Variant: This variant (1) encodes the full length isoform.
COMPLETENESS: complete on the 3' end.
Does it come from my script ?
Do you see the same thing ?
Thanks
--
Sebastien MORETTI
CNRS - IGS
31 chemin Joseph Aiguier
13402 Marseille cedex 20, FRANCE
tel. +33 (0)4 91 16 44 55
More information about the Bioperl-l
mailing list