[Biojava-l] Genbank feature parsing performance

Fri Jun 17 10:33:28 UTC 2011

Thanks Martin,

I already tried the regex. The performance increase was < 10%.

My situation is different in 2 points:
1. info to extract from genbank file is always present.
2. there is multiple feature to extract from each record.

I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter.

Regards,

khalil

On 17 Jun 2011, at 12:12, Martin Jones wrote:

> Hi,
> 
> I have had the same issue when parsing large sets of genbank files. In
> my case, the workaround was to first treat the whole genbank record as
> a string, and do a quick regex match to check if it contained
> something of interest (in my case I was searching for specific
> taxids):
> 
>    // first do a quick pattern-match to extract the taxid so we can
> exit early without the overhead of parsing the whole file
>    private final Pattern taxidPattern =
> Pattern.compile("db_xref=\\\"taxon:(\\d+)");
>    Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
>    if (taxidMatcher.find()) {
>        def taxid = taxidMatcher[0][1].toInteger()
>        if (!taxidList.contains(taxid)) {
>            return
>        }
>    // here do the slow part of actually parsing all the features
> 
> 
> This is in Groovy so there are a few syntactical differences. If you
> are only interested in a subset of the GenBank records, then this
> approach might be of use.
> 
> M
> 
> 
> 
> 
> On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>> Hi,
>> 
>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>> 
>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>> 
>> Feature extraction is done via:
>> 
>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
>> FeatureHolder fh = richSequence.filter(ff);
>> Feature feat = fh.features().next();
>> ...
>> 
>> Any suggestion on how to improve the performance of features extraction is welcome.
>> 
>> Thanks,
>> 
>> khalil
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>>