[Bioperl-l] BioSQL: loading large sequence records,
and taxon parsing
Xiaoying Lin
xylin00 at yahoo.com
Tue Jun 17 19:12:10 EDT 2003
Sorry, my palm touched the touchpad and sent the e-mail by accident.
Hi, I have 3 questions related to bioSQL (with latest CVS co, and bioperl
1.2.1)
1. I am wondering if anyone has tried to load a large sequence (like a whole
chromosome with annotation). It took me overnight to load in a 20Mb sequence
with some 4000 genes-worth annotation, on a laptop of P-III, 750 MHz, and 250Mb
mem.
Is there any way to make this faster? besides buying a faster machine ;-)
2. In the taxon table, there is a column 'mito_genetic_code'
Have people thought about genetic code for plastid genome, such as chloroplast?
3. The problem I encountered that may be related to how the taxon_name table is
populated by the load_seqdatabase.pl (or modules called by). I loaded the
database with 2 organelle genomes the mito and the chloroplast with following
two records in that order. Though both records show up in the bioentry table,
it seems only the info from the first record got populated into the taxon_name
table:
taxon_id | name | name_class
----------+------------------------------------+-----------------
1 | Eukaryota | scientific name
2 | Viridiplantae | scientific name
.......... extra lines removed ...................
13 | Brassicaceae | scientific name
14 | Arabidopsis | scientific name
15 | Mitochondrion | scientific name
16 | Mitochondrion Arabidopsis | scientific name
17 | Mitochondrion Arabidopsis thaliana | scientific name
17 | thale cress | common name
(18 rows)
This could also be an peculiar data issue with the GenBank records, they are
both getting the same taxon_id, although diff genomes. just like to hear your
thoughts on this.
Thanks in advance.
-Xiaoying
==> /data/A_thaliana/At.mito.gb <==
LOCUS NC_001284 366923 bp DNA linear PLN 27-MAR-2001
DEFINITION Arabidopsis thaliana mitochondrion, complete genome.
ACCESSION NC_001284
VERSION NC_001284.1 GI:13449290
KEYWORDS .
SOURCE thale cress.
ORGANISM Mitochondrion Arabidopsis thaliana
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis.
REFERENCE 1 (bases 1 to 366923)
AUTHORS Marienfeld,J., Unseld,M., Brandt,P. and Brennicke,A.
TITLE Genomic recombination of the mitochondrial atp6 gene in Arabidopsis
thaliana at the protein processing site creates two different
presequences
JOURNAL DNA Res. 3 (5), 287-290 (1996)
MEDLINE 97191539
REFERENCE 2 (bases 1 to 366923)
AUTHORS Giege,P. and Brennicke,A.
TITLE RNA editing in Arabidopsis mitochondria effects 441 C to U changes
in ORFs
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 96 (26), 15324-15329 (1999)
MEDLINE 20079652
REFERENCE 3 (bases 1 to 366923)
AUTHORS Unseld,M., Marienfeld,J.R., Brandt,P. and Brennicke,A.
TITLE The mitochondrial genome of Arabidopsis thaliana contains 57 genes
in 366924 nucleotides
JOURNAL Nat. Genet.
REFERENCE 4 (bases 1 to 366923)
AUTHORS Marienfeld,J.R.
TITLE Direct Submission
JOURNAL Submitted (30-SEP-1996) J.R. Marienfeld, Institut fuer
Genbiologische Forschung GmbH, Ihnestrasse 63, 14195 Berlin, FRG
FEATURES Location/Qualifiers
source 1..366923
/organism="Arabidopsis thaliana"
/organelle="mitochondrion"
/variety="Columbia"
/db_xref="taxon:3702"
/sub_clone="pUC19"
==> /data/A_thaliana/At.chl.gb <==
LOCUS NC_000932 154478 bp DNA circular PLN 03-APR-2000
DEFINITION Arabidopsis thaliana chloroplast, complete genome.
ACCESSION NC_000932
VERSION NC_000932.1 GI:7525012
KEYWORDS .
SOURCE thale cress.
ORGANISM Chloroplast Arabidopsis thaliana
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis.
REFERENCE 1 (sites)
AUTHORS Sato,S., Nakamura,Y., Kaneko,T., Asamizu,E. and Tabata,S.
TITLE Complete structure of the chloroplast genome of Arabidopsis
thaliana
JOURNAL DNA Res. 6 (5), 283-290 (1999)
MEDLINE 20039611
PUBMED 10574454
REFERENCE 2 (bases 1 to 154478)
AUTHORS Nakamura,Y.
TITLE Direct Submission
JOURNAL Submitted (09-SEP-1999) Yasukazu Nakamura, Kazusa DNA Research
Institute, Laboratory of Gene Structure 2; Yana 1532-3, Kisarazu,
Chiba 292-0812, Japan (E-mail:ynakamu at kazusa.or.jp,
URL:http://www.kazusa.or.jp/gene-s2/, Tel:81-438-52-3935,
Fax:81-438-52-3934)
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from AP000423.
FEATURES Location/Qualifiers
source 1..154478
/organism="Arabidopsis thaliana"
/organelle="plastid:chloroplast"
/strain="Columbia"
/db_xref="taxon:3702"
__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
More information about the Bioperl-l
mailing list