[Bioperl-l] RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed
Scott Markel
Scott.Markel at accelrys.com
Tue Jul 30 21:18:52 UTC 2013
According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry. See below (*) or the second bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html for details.
This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string. See lines 713-741 of Bio::SeqIO::genbank.
Note that this is related to the "Protein Records without Sequence" thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).
Scott
(*) Details on the change
[3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.
Under the new data model for bacterial proteins, a subset of records continue to provide an organism-oriented package of protein records. These records use traditional RefSeq accession prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record. Those NP and YP records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1, include the following flat file display details:
. Genome Annotation Data structured comment is also displayed on protein records for the subset of bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
. Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.
CONTIG join(WP_015644991.1:1..273)
ORIGIN
1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
241 elslkndeif ykgavryigm svlgmgvfdr yfl
Scott Markel, Ph.D.
Principal Bioinformatics Architect email: smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653
5005 Wateridge Vista Drive voice: +1 858 799 5603
San Diego, CA 92121 fax: +1 858 799 5222
USA web: http://www.accelrys.com
http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics
More information about the Bioperl-l
mailing list