[Biojava-l] Re: Proposed addition to the SequenceDB interface

Thomas Down td2@sanger.ac.uk
Sun, 17 Mar 2002 22:04:08 +0000


On Fri, Mar 15, 2002 at 01:51:07PM -0500, Marc Colosimo wrote:
> Thomas Down <td2@sanger.ac.uk> wrote:
> 
> > Hi...
> >
> > I'm considering adding a filter(FeatureFilter); method to
> > SequenceDB, which allows features to be extracted from a
> > whole database, rather than just a single sequence.  Typical
> > usage would be:
> >
> >   SequenceDB seqDB = ...
> >   FeatureHolder mygene = seqDB.filter(
> >       new FeatureFilter.ByAnnotation("gene.id", "BRCA2")
> >   );
> 
> Would this return the feature as some sort of generic gene.id feature? My
> growing concern is that for each file/db/SQL format we are adding features with
> their original names rather than some defined BioJava enforced named feature. I
> noticed a dtd for features. Unfortunately, I don't know much about XML besides
> the simple things. Could we make something like gene_id, accession_no, etc...
> ? By using these set names, you don't have to know what a gene_id tag is for
> EMBL, genbank, SQL,.......
> 
> Or have I missed this ability in BioJava somehow?

No, your concern is quite justified.  It is, indeed, necessary to
have some specialized knowledge about a particular data source before
you can really make use of the tag-value data present in the
Annotation bundles.

I think a set of `common' key names would be a big help, and I'd welcome
any proposals for what should be in here (the standard set of feature
types and qualifiers from EMBL might be a good starting point, but
probably not a complete solution).  I'd also like to be able to introspect,
for a given database, what properties I should expect to find on features.
The AnnotationType objects, written recently by Matthew, ought to be one
part of the puzzle.

Even before this problem is solved, the filter-all-features-in-a-database
operator still seems to me to be useful -- and I can't see any way in
which it should make improved standardization and `introspectability' harder
in the future.  Or am I missing something?

    Thomas.