[Bioperl-l] BioSQL: loading large sequence records,
and taxon parsing
Hilmar Lapp
hlapp at gnf.org
Fri Jun 20 14:31:45 EDT 2003
> -----Original Message-----
> From: Lin, Xiaoying [mailto:Xiaoying.Lin at celera.com]
> Sent: Friday, June 20, 2003 6:00 AM
> To: Juguang Xiao
> Cc: bioperl-l at bioperl.org
> Subject: RE: [Bioperl-l] BioSQL: loading large sequence
> records,and taxon parsing
>
>
>
> > From: Juguang Xiao [mailto:juguang at tll.org.sg]
> > > 1. I am wondering if anyone has tried to load a large
> > > sequence (like a
> >
> > I have loaded the whole swissprot, embl and trembl dataset
> > into biosql
> > in mysql. It is not as fast as we expected but endurable. :-)
>
> Did you include the genome data from GenBank? Where 1 record
> is for a whole chromosome.
>
> It took 6 hrs to load a single chromosome (20Mb) on my laptop
> before anything happens to the disk, I guess most of the time
> was spent on creating objs. ..
I suspect it was spent on disk thrashing because you have too little
memory on the machine. Try it on a machine with 1GB or more.
-hilmar
>
> Thanks.
>
> -Xiaoying
>
> >
> > >
> > > 3. The problem I encountered that may be related to how the
> > taxon_name
> > > table is
> > > populated by the load_seqdatabase.pl (or modules called
> > by). I loaded
> > > the
> > > database with 2 organelle genomes the mito and the
> chloroplast with
> > > following
> > > two records in that order. Though both records show up in the
> > > bioentry table,
> > > it seems only the info from the first record got
> populated into the
> > > taxon_name
> > > table:
> > >
> > > taxon_id | name | name_class
> > > ----------+------------------------------------+-----------------
> > > 1 | Eukaryota | scientific name
> > > 2 | Viridiplantae | scientific name
> > > .......... extra lines removed ...................
> > > 13 | Brassicaceae | scientific name
> > > 14 | Arabidopsis | scientific name
> > > 15 | Mitochondrion | scientific name
> > > 16 | Mitochondrion Arabidopsis | scientific name
> > > 17 | Mitochondrion Arabidopsis thaliana | scientific name
> > > 17 | thale cress | common name
> > > (18 rows)
> >
> > To be honest, I do not care about it, as long as you can fetch the
> > result out correctly. I actually met such case before. One
> > way to solve
> > it is to load_ncbi_taxonomy before load your sequence. (That may be
> > unnecessary in your case)
>
> This is something we have to be aware of before querying by
> 'name', one won't get anything if name='Arabidopsis thaliana'
> type of query. taxon_id (ncbi) may be safer to use.
> Preloading pthe taxonomy probably is the only way to deal
> with this if it is a large data set like the whole EMBL....
>
>
> -Xiaoying
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-> bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list