[Bioperl-l] Bio::Index::Fastq - Interface for indexing (multiple) fastq files failure

Mon Apr 12 14:08:35 UTC 2010

(getting back on this thread, to make sure the response is logged)

The ID has to be added to the database iteratively (there is even a way to customize that with a callback).  We could count there.

I still think Jason has a point re: the short read aligners; we should investigate how they index those.  Might not be feasible to replicate that completely within bioperl (why reinvent the wheel?), but we could set up a Bio::DB::* modules to access them if we know how they are indexed.

chris

On Apr 7, 2010, at 8:25 AM, Till Bayer wrote:

> Hey Chris,
> 
> > This does sound like a very good solution, would just need a max value to trigger the warning.
> 
> A problem may be that the module does not know the number of fastq entries before it gets done indexing. Or does the index go into memory first and is dumped into the DB afterwards?
> If not maybe the warning could be triggered by the fastq file size? That's hardly accurate, but better than nothing...
> 
> Cheers,
> 
> Till
> 
> On 4/7/2010 3:36 PM, Christopher Fields wrote:
>> Jason, Till,
>> 
>> Did you notice who the author was?  That would be our own Mark Jensen!  ;>
>> 
>> http://search.cpan.org/~majensen/SQLite_File-0.02/
>> 
>> This does sound like a very good solution, would just need a max value to trigger the warning.  I'll still need to look over the FASTQ index module to ensure it's indexing correctly (I think it does in most cases), but this should be much easier to implement.
>> 
>> chris
>> 
>> On Apr 7, 2010, at 1:34 AM, Jason Stajich wrote:
>> 
>>> Thanks Till!  That might solve the problem quite well and would be worth a benchmarking attempt to see what happens.
>>> 
>>> -jason
>>> Till Bayer wrote, On 4/6/10 11:21 PM:
>>>> Hey,
>>>> 
>>>> there is also SQLite_File, which has DB_File emulation and can be used with AnyDBM_File to just store the offsets. It adds another layer, but you could avoid another module or a required dependency, I guess.
>>>> Maybe there could be a warning like 'you are indexing large fastq file, this would work better if DBD::SQLite was installed'.
>>>> 
>>>> Cheers,
>>>> 
>>>> Till
>>>> 
>>>> 
>>>> On 4/6/2010 11:28 PM, Jason Stajich wrote:
>>>>> I think it is a SQLite is a good solution but I still found things a bit
>>>>> slow when I was storing all the data in the db, but if we are instead
>>>>> just indexing byte offsets in the file (which is what the current
>>>>> indexing is doing) maybe it will perform well enough.
>>>>> 
>>>>> One question on implementing this is do we want to have plug-in
>>>>> implementations to the Bio::Index:: classes (and Bio::DB::Fasta as well
>>>>> I would think) that can abstract the indexing method or just a new
>>>>> implementation as Bio::Index::FastqSQLite... Or we can just replace
>>>>> BDB/DB_File with SQLite and now have a new required dependency?
>>>>> 
>>>>> I'd want to also look at the solutions employed in some of the short
>>>>> read aligners if they do index the fastq files in any other way.
>>>>> 
>>>>> -jason
>>>>> Chris Fields wrote, On 4/6/10 12:47 PM:
>>>>>> No problem, it points to issue in the current implementation that need
>>>>>> addressing.
>>>>>> 
>>>>>> Jason, you thinking we just need to replace BDB with SQLite, or you
>>>>>> thinking something else?
>>>>>> 
>>>>>> chris
>>>>>> 
>>>>>> On Apr 6, 2010, at 2:38 PM, KOVALIC, DAVID K [AG/1000] wrote:
>>>>>> 
>>>>>>> Guys,
>>>>>>> 
>>>>>>> Thanks for information; it is good to know what the problem is.
>>>>>>> 
>>>>>>> I am afraid I am not much of a programmer so I am not liable to be much
>>>>>>> help with any work switching out the back-end. I can however volunteer
>>>>>>> for testing purposes if this helps at all.
>>>>>>> 
>>>>>>> I think this is just a case of NGS data volumes having overtaken a
>>>>>>> previously adequate implementations.
>>>>>>> 
>>>>>>> David
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>>>>> Sent: Monday, April 05, 2010 6:57 PM
>>>>>>> To: Peter
>>>>>>> Cc: Jason Stajich; KOVALIC, DAVID K [AG/1000]; bioperl-l at bioperl.org
>>>>>>> Subject: Re: [Bioperl-l] Bio::Index::Fastq - Interface for indexing
>>>>>>> (multiple) fastq files failure
>>>>>>> 
>>>>>>> On Apr 5, 2010, at 6:15 PM, Peter wrote:
>>>>>>> 
>>>>>>>> On Mon, Apr 5, 2010 at 11:53 PM, Jason Stajich<jason at bioperl.org>
>>>>>>> wrote:
>>>>>>>>> Hi David - I am not sure this is going to be the right tool for the
>>>>>>> job.
>>>>>>>>> I'm concerned that none of the Bio::Index:: will really work for
>>>>>>>>> Illumina/NGS size data because once you get beyond about 4M hash
>>>>>>>>> keys things slow down quite dramatically and/or don't finish.
>>>>>>>>> 
>>>>>>>>> I think we have to consider SQLite implementations or some more
>>>>>>>>> explicit way to handle larger keysize for hashes in the DB_File or
>>>>>>>>> BerkeleyDB approach. A similar slow problem can be seen if you
>>>>>>>>> just index a fastq converted fasta file from a single Illumina lane.
>>>>>>>> Another example, and this was in Python rather than Perl, but
>>>>>>>> SQLite got a thumbs up over an in house hash based approach:
>>>>>>> http://lists.idyll.org/pipermail/biology-in-python/2010-March/000511.htm
>>>>>>> l
>>>>>>>> I think a new SQLite based Bio* OBF successor to the existing
>>>>>>>> BDB based OBDA standard for indexing files could be very interesting.
>>>>>>>> 
>>>>>>>> Peter
>>>>>>> Would be nice to get some ideas performance-wise with some data sets.
>>>>>>> SQLite is a very easy option (I'm using it routinely as well).
>>>>>>> 
>>>>>>> chris
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> This e-mail message may contain privileged and/or confidential
>>>>>>> information, and is intended to be received only by persons entitled
>>>>>>> to receive such information. If you have received this e-mail in
>>>>>>> error, please notify the sender immediately. Please delete it and all
>>>>>>> attachments from any servers, hard drives or any other media. Other
>>>>>>> use of this e-mail by you is strictly prohibited.
>>>>>>> 
>>>>>>> 
>>>>>>> All e-mails and attachments sent and received are subject to
>>>>>>> monitoring, reading and archival by Monsanto, including its
>>>>>>> subsidiaries. The recipient of this e-mail is solely responsible for
>>>>>>> checking for the presence of "Viruses" or other "Malware". Monsanto,
>>>>>>> along with its subsidiaries, accepts no liability for any damage
>>>>>>> caused by any such code transmitted by or accompanying this e-mail or
>>>>>>> any attachment.
>>>>>>> ---------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> -- 
> Till Bayer
> 4700 King Abdullah University for Science and Technology
> Building 2, Room 4231-W16
> Thuwal 23955-6900
> Saudi Arabia
> Phone: +96628082373
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l