[Biojava-l] Genbank feature parsing performance
Martin Jones
martin.jones at ed.ac.uk
Fri Jun 17 10:12:05 UTC 2011
Hi,
I have had the same issue when parsing large sets of genbank files. In
my case, the workaround was to first treat the whole genbank record as
a string, and do a quick regex match to check if it contained
something of interest (in my case I was searching for specific
taxids):
// first do a quick pattern-match to extract the taxid so we can
exit early without the overhead of parsing the whole file
private final Pattern taxidPattern =
Pattern.compile("db_xref=\\\"taxon:(\\d+)");
Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
if (taxidMatcher.find()) {
def taxid = taxidMatcher[0][1].toInteger()
if (!taxidList.contains(taxid)) {
return
}
// here do the slow part of actually parsing all the features
This is in Groovy so there are a few syntactical differences. If you
are only interested in a subset of the GenBank records, then this
approach might be of use.
M
On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
> Hi,
>
> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>
> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>
> Feature extraction is done via:
>
> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
> FeatureHolder fh = richSequence.filter(ff);
> Feature feat = fh.features().next();
> ...
>
> Any suggestion on how to improve the performance of features extraction is welcome.
>
> Thanks,
>
> khalil
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
More information about the Biojava-l
mailing list