[Bioperl-l] Bio::DB::Fasta and threads
Florent Angly
florent.angly at gmail.com
Mon Dec 3 02:36:28 UTC 2012
Hi all,
This is in response to Carson Holt's report that Bio::DB::Fasta does not
play well with threads: https://redmine.open-bio.org/issues/3397
The first issue is the serialization of Bio::DB::IndexedBase-inheriting
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
threading (for example when using Thread::Queue::Any). I implemented
hooks that make it transparent to serialize using Storable freeze() and
thaw().
Another issue was the lack of communication between different
Bio::DB::IndexedBase instances, which means that an instance could
easily be writing or deleting the database that another instance is
working on. To fix this, I needed some form of locking.
Some database Bio::DB::IndexedBase backends (DB_file) have some support
for locking but Bio::DB::IndexedBase also supports other database
backends for which there is no native locking mechanism. So, I had to
come up with a more general solution: a lock file. I noticed that
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on
flock(), which means that it does not work with NFS-mounted filesystems.
All the Bioperl-based scripts I (and most likely many others) write run
on servers that use NFS, so this support is important. I have found only
one way to do the NFS locking safely, using File::SharedNFSLock. It has
a few downsides though:
1/ it is an external dependency,
2/ it does not work on FAT filesystems (should be mostly restricted
to USB sticks nowadays) and the lock is never acquired, and
3/ at the moment, it requires a patch to work in threaded context
(https://rt.cpan.org/Public/Bug/Display.html?id=81597)
Note that while I have now added basic support for threads in
Bio::DB::IndexedBase was added, I still get segfaults in specific cases,
for example when returning a database or sequence object. This might be
related to this issue:
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the
new code seems to work nicely. See the branch
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to
test yourself. For example, one can now run multiple threads, each of
them creating a Bio::DB::Fasta database from the same FASTA file: the
first thread performs the indexing while the others wait nicely for the
indexing to be finished to query the database.
Comments welcome. Regards,
Florent
More information about the Bioperl-l
mailing list