[Bioperl-l] BioSQL: loading large sequence records,
and taxon parsing
Lin, Xiaoying
Xiaoying.Lin at celera.com
Fri Jun 20 09:59:50 EDT 2003
> From: Juguang Xiao [mailto:juguang at tll.org.sg]
> > 1. I am wondering if anyone has tried to load a large
> > sequence (like a
>
> I have loaded the whole swissprot, embl and trembl dataset
> into biosql
> in mysql. It is not as fast as we expected but endurable. :-)
Did you include the genome data from GenBank? Where 1 record is for a
whole chromosome.
It took 6 hrs to load a single chromosome (20Mb) on my laptop before
anything happens to the disk, I guess most of the time was spent on
creating objs. ..
Thanks.
-Xiaoying
>
> >
> > 3. The problem I encountered that may be related to how the
> taxon_name
> > table is
> > populated by the load_seqdatabase.pl (or modules called
> by). I loaded
> > the
> > database with 2 organelle genomes the mito and the chloroplast with
> > following
> > two records in that order. Though both records show up in the
> > bioentry table,
> > it seems only the info from the first record got populated into the
> > taxon_name
> > table:
> >
> > taxon_id | name | name_class
> > ----------+------------------------------------+-----------------
> > 1 | Eukaryota | scientific name
> > 2 | Viridiplantae | scientific name
> > .......... extra lines removed ...................
> > 13 | Brassicaceae | scientific name
> > 14 | Arabidopsis | scientific name
> > 15 | Mitochondrion | scientific name
> > 16 | Mitochondrion Arabidopsis | scientific name
> > 17 | Mitochondrion Arabidopsis thaliana | scientific name
> > 17 | thale cress | common name
> > (18 rows)
>
> To be honest, I do not care about it, as long as you can fetch the
> result out correctly. I actually met such case before. One
> way to solve
> it is to load_ncbi_taxonomy before load your sequence. (That may be
> unnecessary in your case)
This is something we have to be aware of before querying by 'name', one
won't get anything if name='Arabidopsis thaliana' type of query.
taxon_id (ncbi) may be safer to use. Preloading pthe taxonomy probably
is the only way to deal with this if it is a large data set like the
whole EMBL....
-Xiaoying
More information about the Bioperl-l
mailing list