[Open-bio-l] [Bioperl-l] RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed
Peter Cock
p.j.a.cock at googlemail.com
Tue Jul 30 22:21:47 UTC 2013
Thanks Scott,
I fell foul of this change with Biopython testing GenPept
format records from NCBI Entrez recently, I'd assumed
it might have been a short term glitch:
https://github.com/biopython/biopython/commit/b6ccd0f05804c944f23e4e38df877e30761f491e
https://github.com/biopython/biopython/commit/f3a9e33e428a5ecfb490d4f6f0ede7695fcde0d2
CC'ing the cross project list in case BioRuby or BioJava
are also impacted.
Regards,
Peter
On Tue, Jul 30, 2013 at 10:18 PM, Scott Markel
<Scott.Markel at accelrys.com> wrote:
> According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry. See below (*) or the second bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html for details.
>
> This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string. See lines 713-741 of Bio::SeqIO::genbank.
>
> Note that this is related to the "Protein Records without Sequence" thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).
>
> Scott
>
> (*) Details on the change
>
> [3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.
>
> Under the new data model for bacterial proteins, a subset of records continue to provide an organism-oriented package of protein records. These records use traditional RefSeq accession prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record. Those NP and YP records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1, include the following flat file display details:
>
> . Genome Annotation Data structured comment is also displayed on protein records for the subset of bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
> . Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.
>
> CONTIG join(WP_015644991.1:1..273)
> ORIGIN
> 1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
> 61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
> 121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
> 181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
> 241 elslkndeif ykgavryigm svlgmgvfdr yfl
>
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect email: smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653
> 5005 Wateridge Vista Drive voice: +1 858 799 5603
> San Diego, CA 92121 fax: +1 858 799 5222
> USA web: http://www.accelrys.com
>
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
> International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLOS Computational Biology
> Editorial Board: Briefings in Bioinformatics
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Open-Bio-l
mailing list