[Bioperl-l] RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed

Scott Markel Scott.Markel at accelrys.com
Tue Jul 30 21:18:52 UTC 2013


According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry.  See below (*) or the second bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html for details.

This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string.  See lines 713-741 of Bio::SeqIO::genbank.

Note that this is related to the "Protein Records without Sequence" thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).

Scott

(*) Details on the change

[3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.

Under the new data model for bacterial proteins, a subset of records continue to provide an organism-oriented package of protein records. These records use traditional RefSeq accession prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record.  Those NP and YP records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1, include the following flat file display details:

. Genome Annotation Data structured comment is also displayed on protein records for the subset of bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
. Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.

CONTIG      join(WP_015644991.1:1..273)
ORIGIN      
        1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
       61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
      121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
      181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
      241 elslkndeif ykgavryigm svlgmgvfdr yfl

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
5005 Wateridge Vista Drive          voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics







More information about the Bioperl-l mailing list