[Biojava-l] Genbank feature parsing performance

Mark Fortner phidias51 at gmail.com
Fri Jun 17 16:58:08 UTC 2011


Hi Khalil,
Did you try the genbank xml format?

Mark


On Fri, Jun 17, 2011 at 9:21 AM, Khalil El Mazouari <
khalil.elmazouari at gmail.com> wrote:

> Hi,
>
> exec time for parsing Genbank, EMBL and EMBL-XML is ± the same.
>
> However, writing sequence in EMBL format was 87% slower vs Genbank format.
>
> Regards,
>
> khalil
>
>
> On 17 Jun 2011, at 12:36, Martin Jones wrote:
>
> > Yes, this approach won't be much use if you are interested in the
> > contents of every genbank record.
> >
> > Have you thought about parsing the gb files in parallel? In my
> > experience, parsing genbank files scales quite nicely when done in
> > multiple threads. I have used the GPars library for this type of job
> > and it is very nice to use:
> >
> > http://gpars.codehaus.org/Parallelizer
> >
> >
> > M
> >
> >
> >
> > On 17 June 2011 11:33, Khalil El Mazouari <khalil.elmazouari at gmail.com>
> wrote:
> >> Thanks Martin,
> >>
> >> I already tried the regex. The performance increase was < 10%.
> >>
> >> My situation is different in 2 points:
> >> 1. info to extract from genbank file is always present.
> >> 2. there is multiple feature to extract from each record.
> >>
> >> I agree with you. Extracting a single field from a genbank file, is done
> munch faster with simple regex than with FeatureFilter.
> >>
> >> Regards,
> >>
> >> khalil
> >>
> >> On 17 Jun 2011, at 12:12, Martin Jones wrote:
> >>
> >>> Hi,
> >>>
> >>> I have had the same issue when parsing large sets of genbank files. In
> >>> my case, the workaround was to first treat the whole genbank record as
> >>> a string, and do a quick regex match to check if it contained
> >>> something of interest (in my case I was searching for specific
> >>> taxids):
> >>>
> >>>    // first do a quick pattern-match to extract the taxid so we can
> >>> exit early without the overhead of parsing the whole file
> >>>    private final Pattern taxidPattern =
> >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)");
> >>>    Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
> >>>    if (taxidMatcher.find()) {
> >>>        def taxid = taxidMatcher[0][1].toInteger()
> >>>        if (!taxidList.contains(taxid)) {
> >>>            return
> >>>        }
> >>>    // here do the slow part of actually parsing all the features
> >>>
> >>>
> >>> This is in Groovy so there are a few syntactical differences. If you
> >>> are only interested in a subset of the GenBank records, then this
> >>> approach might be of use.
> >>>
> >>> M
> >>>
> >>>
> >>>
> >>>
> >>> On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I am developing an app where features are extracted from a large
> genbank file, and processed: multiple alignment, annotation....
> >>>>
> >>>> The feature extraction is a real bottleneck in my app. It consumes 87%
> of total execution time.
> >>>>
> >>>> Feature extraction is done via:
> >>>>
> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
> >>>> FeatureHolder fh = richSequence.filter(ff);
> >>>> Feature feat = fh.features().next();
> >>>> ...
> >>>>
> >>>> Any suggestion on how to improve the performance of features
> extraction is welcome.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> khalil
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>>>
> >>
> >>
> >>
>
>




More information about the Biojava-l mailing list