[Biojava-l] Genbank feature parsing performance

Fri Jun 17 16:21:43 UTC 2011

Hi,

exec time for parsing Genbank, EMBL and EMBL-XML is ± the same.

However, writing sequence in EMBL format was 87% slower vs Genbank format.

Regards,

khalil

On 17 Jun 2011, at 12:36, Martin Jones wrote:

> Yes, this approach won't be much use if you are interested in the
> contents of every genbank record.
> 
> Have you thought about parsing the gb files in parallel? In my
> experience, parsing genbank files scales quite nicely when done in
> multiple threads. I have used the GPars library for this type of job
> and it is very nice to use:
> 
> http://gpars.codehaus.org/Parallelizer
> 
> 
> M
> 
> 
> 
> On 17 June 2011 11:33, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>> Thanks Martin,
>> 
>> I already tried the regex. The performance increase was < 10%.
>> 
>> My situation is different in 2 points:
>> 1. info to extract from genbank file is always present.
>> 2. there is multiple feature to extract from each record.
>> 
>> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter.
>> 
>> Regards,
>> 
>> khalil
>> 
>> On 17 Jun 2011, at 12:12, Martin Jones wrote:
>> 
>>> Hi,
>>> 
>>> I have had the same issue when parsing large sets of genbank files. In
>>> my case, the workaround was to first treat the whole genbank record as
>>> a string, and do a quick regex match to check if it contained
>>> something of interest (in my case I was searching for specific
>>> taxids):
>>> 
>>>    // first do a quick pattern-match to extract the taxid so we can
>>> exit early without the overhead of parsing the whole file
>>>    private final Pattern taxidPattern =
>>> Pattern.compile("db_xref=\\\"taxon:(\\d+)");
>>>    Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
>>>    if (taxidMatcher.find()) {
>>>        def taxid = taxidMatcher[0][1].toInteger()
>>>        if (!taxidList.contains(taxid)) {
>>>            return
>>>        }
>>>    // here do the slow part of actually parsing all the features
>>> 
>>> 
>>> This is in Groovy so there are a few syntactical differences. If you
>>> are only interested in a subset of the GenBank records, then this
>>> approach might be of use.
>>> 
>>> M
>>> 
>>> 
>>> 
>>> 
>>> On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>>>> 
>>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>>>> 
>>>> Feature extraction is done via:
>>>> 
>>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
>>>> FeatureHolder fh = richSequence.filter(ff);
>>>> Feature feat = fh.features().next();
>>>> ...
>>>> 
>>>> Any suggestion on how to improve the performance of features extraction is welcome.
>>>> 
>>>> Thanks,
>>>> 
>>>> khalil
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>>> 
>> 
>> 
>>