[Biojava-l] Genbank feature parsing performance
Khalil El Mazouari
khalil.elmazouari at gmail.com
Fri Jun 17 10:33:28 UTC 2011
Thanks Martin,
I already tried the regex. The performance increase was < 10%.
My situation is different in 2 points:
1. info to extract from genbank file is always present.
2. there is multiple feature to extract from each record.
I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter.
Regards,
khalil
On 17 Jun 2011, at 12:12, Martin Jones wrote:
> Hi,
>
> I have had the same issue when parsing large sets of genbank files. In
> my case, the workaround was to first treat the whole genbank record as
> a string, and do a quick regex match to check if it contained
> something of interest (in my case I was searching for specific
> taxids):
>
> // first do a quick pattern-match to extract the taxid so we can
> exit early without the overhead of parsing the whole file
> private final Pattern taxidPattern =
> Pattern.compile("db_xref=\\\"taxon:(\\d+)");
> Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
> if (taxidMatcher.find()) {
> def taxid = taxidMatcher[0][1].toInteger()
> if (!taxidList.contains(taxid)) {
> return
> }
> // here do the slow part of actually parsing all the features
>
>
> This is in Groovy so there are a few syntactical differences. If you
> are only interested in a subset of the GenBank records, then this
> approach might be of use.
>
> M
>
>
>
>
> On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>> Hi,
>>
>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>>
>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>>
>> Feature extraction is done via:
>>
>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
>> FeatureHolder fh = richSequence.filter(ff);
>> Feature feat = fh.features().next();
>> ...
>>
>> Any suggestion on how to improve the performance of features extraction is welcome.
>>
>> Thanks,
>>
>> khalil
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
More information about the Biojava-l
mailing list