[Open-bio-l] [Bioperl-l] RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed

Tue Jul 30 22:21:47 UTC 2013

Thanks Scott,

I fell foul of this change with Biopython testing GenPept
format records from NCBI Entrez recently, I'd assumed
it might have been a short term glitch:

https://github.com/biopython/biopython/commit/b6ccd0f05804c944f23e4e38df877e30761f491e

https://github.com/biopython/biopython/commit/f3a9e33e428a5ecfb490d4f6f0ede7695fcde0d2

CC'ing the cross project list in case BioRuby or BioJava
are also impacted.

Regards,

Peter

On Tue, Jul 30, 2013 at 10:18 PM, Scott Markel
<Scott.Markel at accelrys.com> wrote:
> According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry.  See below (*) or the second bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html for details.
>
> This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string.  See lines 713-741 of Bio::SeqIO::genbank.
>
> Note that this is related to the "Protein Records without Sequence" thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).
>
> Scott
>
> (*) Details on the change
>
> [3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.
>
> Under the new data model for bacterial proteins, a subset of records continue to provide an organism-oriented package of protein records. These records use traditional RefSeq accession prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record.  Those NP and YP records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1, include the following flat file display details:
>
> . Genome Annotation Data structured comment is also displayed on protein records for the subset of bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
> . Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.
>
> CONTIG      join(WP_015644991.1:1..273)
> ORIGIN
>         1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
>        61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
>       121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
>       181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
>       241 elslkndeif ykgavryigm svlgmgvfdr yfl
>
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 5005 Wateridge Vista Drive          voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
>
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLOS Computational Biology
> Editorial Board: Briefings in Bioinformatics
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l