[Bioperl-l] BioSQL: loading large sequence records, and taxon parsing

Xiaoying Lin xylin00 at yahoo.com
Tue Jun 17 19:12:10 EDT 2003


Sorry, my palm touched the touchpad and sent the e-mail by accident.


Hi, I have 3 questions related to bioSQL (with latest CVS co, and bioperl
1.2.1)


1. I am wondering if anyone has tried to load a large sequence (like a whole
chromosome with annotation). It took me overnight to load in a 20Mb sequence
with some 4000 genes-worth annotation, on a laptop of P-III, 750 MHz, and 250Mb
mem.
Is there any way to make this faster? besides buying a faster machine ;-)

2. In the taxon table, there is a column 'mito_genetic_code' 
Have people thought about genetic code for plastid genome, such as chloroplast?


3. The problem I encountered that may be related to how the taxon_name table is
populated by the load_seqdatabase.pl (or modules called by). I loaded the
database with 2 organelle genomes the mito and the chloroplast with following
two records in that order.  Though both records show up in the bioentry table,
it seems only the info from the first record got populated into the taxon_name
table:

taxon_id |                name                |   name_class
----------+------------------------------------+-----------------
        1 | Eukaryota                          | scientific name
        2 | Viridiplantae                      | scientific name
.......... extra lines removed ...................
       13 | Brassicaceae                       | scientific name
       14 | Arabidopsis                        | scientific name
       15 | Mitochondrion                      | scientific name
       16 | Mitochondrion Arabidopsis          | scientific name
       17 | Mitochondrion Arabidopsis thaliana | scientific name
       17 | thale cress                        | common name
(18 rows)

This could also be an peculiar data issue with the GenBank records, they are
both getting the same taxon_id, although diff genomes. just like to hear your
thoughts on this. 

Thanks in advance.


-Xiaoying





==> /data/A_thaliana/At.mito.gb <==
LOCUS       NC_001284             366923 bp    DNA     linear   PLN 27-MAR-2001
DEFINITION  Arabidopsis thaliana mitochondrion, complete genome.
ACCESSION   NC_001284
VERSION     NC_001284.1  GI:13449290
KEYWORDS    .
SOURCE      thale cress.
  ORGANISM  Mitochondrion Arabidopsis thaliana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; 
          Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
            Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis.
REFERENCE   1  (bases 1 to 366923)
  AUTHORS   Marienfeld,J., Unseld,M., Brandt,P. and Brennicke,A.
  TITLE     Genomic recombination of the mitochondrial atp6 gene in Arabidopsis
            thaliana at the protein processing site creates two different
            presequences
  JOURNAL   DNA Res. 3 (5), 287-290 (1996)
  MEDLINE   97191539
REFERENCE   2  (bases 1 to 366923)
  AUTHORS   Giege,P. and Brennicke,A.
  TITLE     RNA editing in Arabidopsis mitochondria effects 441 C to U changes 
          in ORFs
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 96 (26), 15324-15329 (1999)
  MEDLINE   20079652
REFERENCE   3  (bases 1 to 366923)
  AUTHORS   Unseld,M., Marienfeld,J.R., Brandt,P. and Brennicke,A.
  TITLE     The mitochondrial genome of Arabidopsis thaliana contains 57 genes 
          in 366924 nucleotides
  JOURNAL   Nat. Genet.
REFERENCE   4  (bases 1 to 366923)
  AUTHORS   Marienfeld,J.R.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-SEP-1996) J.R. Marienfeld, Institut fuer
            Genbiologische Forschung GmbH, Ihnestrasse 63, 14195 Berlin, FRG
FEATURES             Location/Qualifiers
     source          1..366923
                     /organism="Arabidopsis thaliana"
                     /organelle="mitochondrion"
                     /variety="Columbia"
                     /db_xref="taxon:3702"
                     /sub_clone="pUC19"

==> /data/A_thaliana/At.chl.gb <==
LOCUS       NC_000932             154478 bp    DNA     circular PLN 03-APR-2000
DEFINITION  Arabidopsis thaliana chloroplast, complete genome.
ACCESSION   NC_000932
VERSION     NC_000932.1  GI:7525012
KEYWORDS    .
SOURCE      thale cress.
  ORGANISM  Chloroplast Arabidopsis thaliana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; 
          Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
            Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis.
REFERENCE   1  (sites)
  AUTHORS   Sato,S., Nakamura,Y., Kaneko,T., Asamizu,E. and Tabata,S.
  TITLE     Complete structure of the chloroplast genome of Arabidopsis
            thaliana
  JOURNAL   DNA Res. 6 (5), 283-290 (1999)
  MEDLINE   20039611
   PUBMED   10574454
REFERENCE   2  (bases 1 to 154478)
  AUTHORS   Nakamura,Y.
  TITLE     Direct Submission
  JOURNAL   Submitted (09-SEP-1999) Yasukazu Nakamura, Kazusa DNA Research
            Institute, Laboratory of Gene Structure 2; Yana 1532-3, Kisarazu,
            Chiba 292-0812, Japan (E-mail:ynakamu at kazusa.or.jp,
            URL:http://www.kazusa.or.jp/gene-s2/, Tel:81-438-52-3935,
            Fax:81-438-52-3934)
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from AP000423.
FEATURES             Location/Qualifiers
     source          1..154478
                     /organism="Arabidopsis thaliana"
                     /organelle="plastid:chloroplast"
                     /strain="Columbia"
                     /db_xref="taxon:3702"

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com


More information about the Bioperl-l mailing list