[BioRuby] GSoC weekly status report No.5

Marjan Povolni marian.povolny at gmail.com
Mon Jun 25 20:38:10 UTC 2012


http://blog.mpthecoder.com/post/25870737554/gsoc-weekly-status-report-no-5

*Summary of the last week*

During the last week a few improvements have been made:


   - the validation messages have been improved with file names and line
   number, in the compiler error style,
   - filtering has been added,
   - replacing escaped characters has been re-implemented to get a huge
   performance improvement. The 1GB file that required 10min for parsing
   because of 6.5 milion escaped characters, is now parsed in 22.5 seconds,
   only 0.5 more compared with when replacing them is turned off,
   - added a tool for correctly counting features in a GFF3 file. This will
   be useful because the user can then find a good value for the feature cache
   size by using this tool to get the correct count and the benchmark tool to
   get the count for a particular cache size. The tool is still slow for some
   files, so I’m thinking about how to improve that,
   - other small fixes, comments and similar…



*More on filtering*

The filtering was first implemented using classes, but later refactored
using delegates instead. The result was 50 lines less code.

The user can now specify a filter before parsing a file like this:

GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false,
                          NO_BEFORE_FILTER,
                          OR(ATTRIBUTE("ID", EQUALS("1")),
                             ATTRIBUTE("ID", CONTAINS("2"))));

The first filter which is set to none in this example is the filter before
the line is parsed, that means that the filter doesn’t support ATTRIBUTE
and FIELD predicates.

The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS,
CONTAINS, STARTS_WITH, AND, OR, NOT. In case they’re used in a way which is
not allowed, there will be a compiler error. Otherwise the allowed
combinations should be logical enough to guess (but I’ll document them too).

I altered the benchmark tool a few times to test the performance, and what
I found was very positive, the performance impact in the few tests I did
was very small. I’ll have more data once the next tool is finished.


*New week*

Release early and often - it’s a mantra a heard quite a few times before.
So as the group of mentors and students has agreed, every student will be
releasing a gem at the end of this week.

I’m still not sure what will be in it, because the support for shared
libraries in D compilers for Linux has not been implemented yet. So it will
probably be a combination of a command-line utility and a Ruby module which
uses that utility.

What I have currently in mind is re-implementing the gff3-fetch utility
developed by Pjotr in Ruby, to make it faster using D. But first I’ll
implement filtering functionality for it, so the users can reduce a file to
records which are interesting to them and then parse that using a parser in
Ruby, for example.

A Ruby module that would make using this utility easier for Ruby developers
seems like a good idea for the first release.

Part of this utility will be to support GFF3 output, so that will be
implemented too (and has already been done today to some extend).




More information about the BioRuby mailing list