[Biojava-l] Fastq benchmark

Michael Heuer heuermh at gmail.com
Tue Jan 24 18:00:58 UTC 2012


Hello Mic,

That is an interesting benchmark, and you could probably squeeze a bit
more performance out of fqextract.java by tweaking the data structures
(e.g. provide expected size to the HashMap constructor, use
ImmutableMap from Guava, etc.).

Using bioperl, biopython, bioruby, or biojava for this task will be
much slower than just spitting out lines from a file since they are
all validating the FASTQ format against the specification.

   michael


On Tue, Jan 24, 2012 at 6:08 AM, Scooter Willis <HWillis at scripps.edu> wrote:
> You can try a FASTA version of the file to measure performance gain.
>
> File file = new File("filename");
> Boolean  lazySequenceLoad = true;
>
> LinkedHashMap<String, DNASequence> sequences =
> FastaReaderHelper.readFastaDNASequence(file,lazySequenceLoad);
>
> This will go through and index the accession id and not load any sequence
> data which means no memory allocation and speed. You can then reference
> the DNASequence by name and when you need the sequence data it will use
> the file index to load the sequence data from the file for that specific
> sequence. The same approach can be applied to FASTQ files.
>
> Scooter
>
> On 1/24/12 3:37 AM, "Mic" <mictadlo at gmail.com> wrote:
>
>>Hello,
>>I have found the following benchmark (
>>http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-
>>a-huge-fastq-file/11279#11279
>>)
>>and I just wonder whether it is possible to make Java example even faster?
>>
>>Thank you in advance.




More information about the Biojava-l mailing list