[Bioperl-l] BioSQL: loading large sequence records, and taxon parsing

Fri Jun 20 09:59:50 EDT 2003

> From: Juguang Xiao [mailto:juguang at tll.org.sg] 
> > 1. I am wondering if anyone has tried to load a large 
> > sequence (like a
> 
> I have loaded the whole swissprot, embl and trembl dataset 
> into biosql 
> in mysql. It is not as fast as we expected but endurable. :-)

Did you include the genome data from GenBank? Where 1 record is for a
whole chromosome.

It took 6 hrs to load a single chromosome (20Mb) on my laptop before
anything happens to the disk, I guess most of the time was spent on
creating objs. ..

Thanks.

-Xiaoying

> 
> >
> > 3. The problem I encountered that may be related to how the 
> taxon_name 
> > table is
> > populated by the load_seqdatabase.pl (or modules called 
> by). I loaded 
> > the
> > database with 2 organelle genomes the mito and the chloroplast with 
> > following
> > two records in that order.  Though both records show up in the 
> > bioentry table,
> > it seems only the info from the first record got populated into the 
> > taxon_name
> > table:
> >
> > taxon_id |                name                |   name_class
> > ----------+------------------------------------+-----------------
> >         1 | Eukaryota                          | scientific name
> >         2 | Viridiplantae                      | scientific name
> > .......... extra lines removed ...................
> >        13 | Brassicaceae                       | scientific name
> >        14 | Arabidopsis                        | scientific name
> >        15 | Mitochondrion                      | scientific name
> >        16 | Mitochondrion Arabidopsis          | scientific name
> >        17 | Mitochondrion Arabidopsis thaliana | scientific name
> >        17 | thale cress                        | common name
> > (18 rows)
> 
> To be honest, I do not care about it, as long as you can fetch the 
> result out correctly. I actually met such case before. One 
> way to solve 
> it is to load_ncbi_taxonomy before load your sequence. (That may be 
> unnecessary in your case)

This is something we have to be aware of before querying by 'name', one
won't get anything if name='Arabidopsis thaliana' type of query.
taxon_id (ncbi) may be safer to use. Preloading pthe taxonomy probably
is the only way to deal with this if it is a large data set like the
whole EMBL....

-Xiaoying