[Bioperl-l] [Structure of remote GenBank files]

Sebastien Moretti Sebastien.Moretti at igs.cnrs-mrs.fr
Thu Apr 22 03:15:57 EDT 2004


Hello
I use a BioPerl script to get GenBank and RefSeq files in GenBank flat file 
format.

	#!/usr/bin/perl -w
	
	use strict;
	use Bio::DB::GenBank;
	use Bio::DB::Query::GenBank;
	use Bio::SeqIO;
	my $acc=$ARGV[0] or die "\n\tThe accession number you seek for is 
missing.\n\tTry something like: ./update_estCDK.pl NM_178432\n\n";
	
	$acc=$acc."[Accession]";
	
	my $query_string = "$acc";
	my $query = Bio::DB::Query::GenBank->new(-db=>'nucleotide',
	                                         -query=>$query_string);
	
	my $gb = new Bio::DB::GenBank;
	my $stream = $gb->get_Stream_by_query($query);
	
	my $out=Bio::SeqIO->new(-format=>'genbank');
	my $seq = $stream->next_seq();
	
	my $result=$out->write_seq($seq);
	$result =~ s/^1.*$//;
	#print $out->write_seq($seq);
	print $result;
	
	exit;

It works fine but I have two structures problems in my files:
	- the PUBMED fields are pasted with the JOURNAL fields line above:
  JOURNAL   J. Biol. Chem. 278 (42), 40815-40828 (2003) PUBMED   12912980
or
  JOURNAL   J. Cancer Res. Clin. Oncol. 129 (9), 498-502 (2003) PUBMED
            12884029
or
  JOURNAL   Am. J. Physiol. Heart Circ. Physiol. 284 (6), H1917-H1923 (2003)
            PUBMED   12742823

	- the COMMENT fields haven't blank lines and \n, so COMMENT fields looks
	   compact:
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from Y00272.1 and BC014563.1. On
            Oct 22, 2001 this sequence version replaced gi:4502708. Summary:
            The protein encoded by this gene is a member of the Ser/Thr
            protein kinase family. This protein is a catalytic subunit of the
            highly conserved protein kinase complex known as M-phase promoting
            factor (MPF), which is essential for G1/S and G2/M phase
            transitions of eukaryotic cell cycle. Mitotic cyclins stably
            associate with this protein and function as regulatory subunits.
            The kinase activity of this protein is controlled by cyclin
            accumulation and destruction through the cell cycle. The
            phosphorylation and dephosphorylation of this protein also play
            important regulatory roles in cell cycle control. Transcript
            Variant: This variant (1) encodes the full length isoform.
            COMPLETENESS: complete on the 3' end.

Does it come from my script ?
Do you see the same thing ?
Thanks

-- 
Sebastien MORETTI
CNRS - IGS
31 chemin Joseph Aiguier
13402 Marseille cedex 20, FRANCE
tel. +33 (0)4 91 16 44 55


More information about the Bioperl-l mailing list