[Biojava-l] Genbank feature parsing performance

Fri Jun 17 10:12:05 UTC 2011

Hi,

I have had the same issue when parsing large sets of genbank files. In
my case, the workaround was to first treat the whole genbank record as
a string, and do a quick regex match to check if it contained
something of interest (in my case I was searching for specific
taxids):

    // first do a quick pattern-match to extract the taxid so we can
exit early without the overhead of parsing the whole file
    private final Pattern taxidPattern =
Pattern.compile("db_xref=\\\"taxon:(\\d+)");
    Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
    if (taxidMatcher.find()) {
        def taxid = taxidMatcher[0][1].toInteger()
        if (!taxidList.contains(taxid)) {
            return
        }
    // here do the slow part of actually parsing all the features

This is in Groovy so there are a few syntactical differences. If you
are only interested in a subset of the GenBank records, then this
approach might be of use.

M

On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
> Hi,
>
> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>
> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>
> Feature extraction is done via:
>
> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
> FeatureHolder fh = richSequence.filter(ff);
> Feature feat = fh.features().next();
> ...
>
> Any suggestion on how to improve the performance of features extraction is welcome.
>
> Thanks,
>
> khalil
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>