[Bioperl-l] BioSQL: loading large sequence records, and taxon parsing

Fri Jun 20 14:31:45 EDT 2003

> -----Original Message-----
> From: Lin, Xiaoying [mailto:Xiaoying.Lin at celera.com] 
> Sent: Friday, June 20, 2003 6:00 AM
> To: Juguang Xiao
> Cc: bioperl-l at bioperl.org
> Subject: RE: [Bioperl-l] BioSQL: loading large sequence 
> records,and taxon parsing
> 
> 
> 
> > From: Juguang Xiao [mailto:juguang at tll.org.sg]
> > > 1. I am wondering if anyone has tried to load a large
> > > sequence (like a
> > 
> > I have loaded the whole swissprot, embl and trembl dataset
> > into biosql 
> > in mysql. It is not as fast as we expected but endurable. :-)
> 
> Did you include the genome data from GenBank? Where 1 record 
> is for a whole chromosome.
> 
> It took 6 hrs to load a single chromosome (20Mb) on my laptop 
> before anything happens to the disk, I guess most of the time 
> was spent on creating objs. ..

I suspect it was spent on disk thrashing because you have too little
memory on the machine. Try it on a machine with 1GB or more.

	-hilmar

> 
> Thanks.
> 
> -Xiaoying
> 
> > 
> > >
> > > 3. The problem I encountered that may be related to how the
> > taxon_name
> > > table is
> > > populated by the load_seqdatabase.pl (or modules called
> > by). I loaded
> > > the
> > > database with 2 organelle genomes the mito and the 
> chloroplast with
> > > following
> > > two records in that order.  Though both records show up in the 
> > > bioentry table,
> > > it seems only the info from the first record got 
> populated into the 
> > > taxon_name
> > > table:
> > >
> > > taxon_id |                name                |   name_class
> > > ----------+------------------------------------+-----------------
> > >         1 | Eukaryota                          | scientific name
> > >         2 | Viridiplantae                      | scientific name
> > > .......... extra lines removed ...................
> > >        13 | Brassicaceae                       | scientific name
> > >        14 | Arabidopsis                        | scientific name
> > >        15 | Mitochondrion                      | scientific name
> > >        16 | Mitochondrion Arabidopsis          | scientific name
> > >        17 | Mitochondrion Arabidopsis thaliana | scientific name
> > >        17 | thale cress                        | common name
> > > (18 rows)
> > 
> > To be honest, I do not care about it, as long as you can fetch the
> > result out correctly. I actually met such case before. One 
> > way to solve 
> > it is to load_ncbi_taxonomy before load your sequence. (That may be 
> > unnecessary in your case)
> 
> This is something we have to be aware of before querying by 
> 'name', one won't get anything if name='Arabidopsis thaliana' 
> type of query. taxon_id (ncbi) may be safer to use. 
> Preloading pthe taxonomy probably is the only way to deal 
> with this if it is a large data set like the whole EMBL....
> 
> 
> -Xiaoying
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-> bio.org/mailman/listinfo/bioperl-l
>