[Bioperl-l] FW: GenBank Release 123.0 Now Available
Malcolm Cook
Malcolm.Cook@ppgx.com
Fri, 27 Apr 2001 11:06:08 -0700
I just read the following post in bionet.molbio.genbank and thought of
Bio::SeqIO::Genbank and Bio::DB::Genbank since the format of the LOCUS line
is changing. Please excuse if this is already obvious to the appropriate
module maintainers. I am curious as to what impact if any the change will
have on these modules.
Thanks,
malcolm.cook@ppgx.com
Greetings GenBank Users,
GenBank Release 123.0 is now available via ftp from the National Center
for Biotechnology Information:
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ncbi.nlm.nih.gov genbank GenBank Release 123.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 123.0
Uncompressed, the Release 123.0 flatfiles require roughly 45214 MB
(sequence files only) or 50279 MB (including the 'index' files). The
ASN.1 version requires roughly 40404 MB. From the release notes:
Release Date Base Pairs Entries
122 Feb 2001 11720120326 10896781
123 Apr 2001 12418544023 11545572
Close-of-data was 04/17/2001. Five business days were required to prepare
this release. In the eight-week period between close-of-data for GenBank
122.0
and GenBank 123.0, GenBank grew by 0.698 billion basepairs and 648,791
sequence records.
We would like to remind our users that a GenBank mirror site is
available at ftp://genbank.sdsc.edu/pub . Please consider using this site
in order to speed up your transfer of GenBank releases.
For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 123.0 and Upcoming Changes) have been appended below.
**IMPORTANT**
One of the changes described in Section 1.4 is a redefinition of the
LOCUS line of the GenBank flatfile format, to be introduced in October
of 2001. Every record in GenBank will be affected. If you parse the LOCUS
line of the flatfile, please pay special attention to this upcoming change!
Release 123.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.
New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z),
containing
only those entries new/updated since the Release 123.0 close-of-data, should
be
available by 07:00am EDT, April 25. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 122.0
was
posted.
If you encounter problems while ftp'ing or uncompressing Release 123.0,
please send email outlining your difficulties to info@ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev
GenBank
NCBI/NLM/NIH
1.3 Important Changes in Release 123.0
1.3.1 Headers missing for nine GSS files
Some manual processing of GSS division files was required in order to
correct a content problem. This has resulted in the lack of the standard
flatfile header (See Section 3.1) for nine of the GSS flatfiles:
gbgss1.seq.gz
gbgss2.seq.gz
gbgss3.seq.gz
gbgss4.seq.gz
gbgss5.seq.gz
gbgss6.seq.gz
gbgss19.seq.gz
gbgss36.seq.gz
gbgss37.seq.gz
1.3.2 Organizational changes
Due to database growth, the EST division is now being split into 111
pieces.
Due to database growth, the GSS division is now being split into 37
pieces.
Due to database growth, the INV division is now being split into 4 pieces.
Due to database growth, the PRI division is now being split into 10
pieces.
1.3.3 New HTC division introduced
A new GenBank division for unfinished high-throughput cDNA sequencing
(HTC)
is now included in GenBank releases. HTC sequences may have 5'UTR and 3'UTR
at
their ends, partial coding regions, and introns. A keyword of "HTC" will be
present,
in addition to division code "HTC". Those HTC sequences that undergo
finishing
(eg, re-sequencing) will move to the appropriate taxonomic GenBank division
and
the "HTC" keyword will be removed. A recent project that generates
HTC-quality
data is described in:
Hayashizaki, Y.
Functional annotation of a full-length mouse cDNA collection
Nature 409, 685-690 (2001)
1.3.4 Minor change to REFERENCE line
The REFERENCE keyword for the literature citations associated with a
GenBank
record has previously required a parenthetical component indicating either
the
basepair span to which the citation applies, or "sites" for citations
providing
annotation rather than sequence data. Here are some examples:
REFERENCE 1 (bases 1 to 262290)
REFERENCE 2 (sites)
REFERENCE 3 (bases 1 to 456; bases 700 to 2334)
As of GenBank Release 123.0 (April 2001), this component of the REFERENCE
line has been made optional, to simplify submissions involving a large
number
of sequence changes when the submittor is unable to identify all the
relevant
basepair spans.
Users interested in the details of how a sequence has changed can use
NCBI's Blast-2-Sequences tool:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
1.4 Upcoming Changes
1.4.1 LOCUS line format change : to accomodate longer names and sequences
When the LOCUS line format for the GenBank flatfile was designed nearly
two decades ago, sequences over 10 Mbp in length were not anticipated. As
a result, the maximum length of a LOCUS name is nine characters, and the
maximum length of a sequence is 9,999,999 bases :
---------+---------+---------+---------+---------+---------+---------+------
---
1 10 20 30 40 50 60 70
79
LOCUS AB000383 5423 bp DNA circular VRL 05-FEB-1999
Positions Contents
--------- --------
01-05 LOCUS
06-12 spaces
13-21 Locus name
22-22 space
23-29 Length of sequence, right-justified
31-32 bp
34-36 Blank, ss- (single-stranded), ds- (double-stranded), or
ms- (mixed-stranded)
37-42 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA
43-52 Blank (implies linear) or circular
53-55 The division code (see Section 3.3)
63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
This leads to several problems: a) meaningful names of more than nine
characters cannot be utilized; b) the nine-character limit causes LOCUS
names to be truncated for many segmented sets of more than ten members
(see AF272557, AF272558, etc); c) invalid LOCUS lines result when the
GenBank flatfile format is used to display other types of sequence data.
For (c), consider human contig Hs22_11677 derived from the sequences in
the HTG division of GenBank:
LOCUS Hs22_1167722998459 bp DNA PRI 10-FEB-2001
DEFINITION Homo sapiens chromosome 22 working draft sequence segment.
ACCESSION NT_011520
The LOCUS name ( Hs22_11677 ) collides with the sequence length (
22998459 )
due to the restrictions of the LOCUS line format.
To address the LOCUS problems, a new LOCUS line format which allows names
of up to 18 characters and sequences of up to 99,999,999,999 bases will be
utilized for all GenBank records starting with Release 126.0 in October
2001:
---------+---------+---------+---------+---------+---------+---------+------
---
1 10 20 30 40 50 60 70
79
LOCUS 18_Char_LOCUS_Name 99999999999 bp ss-snRNA circular DIV
DD-MMM-YYYY
Positions Contents
--------- --------
01-05 LOCUS
06-12 spaces
13-30 Locus name
31-31 space
32-42 Length of sequence, right-justified
43-43 space
44-45 bp
46-46 space
47-49 Blank, ss- (single-stranded), ds- (double-stranded), or
ms- (mixed-stranded)
50-54 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA
55-55 space
56-63 Blank (implies linear) or circular
64-64 space
65-67 The division code (see Section 3.3)
68-68 space
69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
Here's how an existing record would appear using this new format:
LOCUS AB000383 5423 bp DNA circular VRL
05-FEB-1999
DEFINITION Leucania seperata nuclear polyhedrosis virus DNA for p13, xe,
envelope protein, complete cds.
ACCESSION AB000383
Sample GenBank flatfiles with the new LOCUS line format will be made
available after Releases 124.0 (June) and 125.0 (August), so that developers
can test software that parses GenBank flatfiles. Further announcements about
the LOCUS line change will be made via these release notes and the GenBank
newsgroup (bionet.molbio.genbank).
1.4.2 NCBI's ftp address will be changed
At some point in the near future NCBI's ftp address will be changed.
The current address:
ncbi.nlm.nih.gov
will become:
ftp.ncbi.nih.gov
Additional details about this change will be made available via these
release notes and the GenBank newsgroup (bionet.molbio.genbank) as they
become available.
1.4.3 Selenocysteine representation
Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.
DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.
Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.
1.4.4 New REFERENCE type for on-line journals
Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:
REFERENCE 1 (bases 1 to 2858)
AUTHORS Smith, J.
TITLE Cloning and expression of a phospholipase gene
JOURNAL Online Publication
REMARK Online-Journal-name; Article Identifier; URL
This format is still tentative; additional information about this new
reference type will be made available via these release notes.
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb@net.bio.net
- subscribe: e-mail biosci-server@net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server@net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis@cmmt.ubc.ca