[Biojava-l] literature references

Bill Bug BillB@doubletwist.com
Tue, 30 May 2000 10:55:33 -0700


Hi All,

You might want to check out the new XML DTD coming from the NIH's National
Library of Medicine.

http://www.nlm.nih.gov/bsd/licensee.html

This DTD is pretty mature, having been developed for PubMed starting a few
years back.

In reference to the complexity that Mr. Down referred to:
	1) Reference Types: Defining a reference 'type' can get quite
complicated.  Types are particularly important in electronic representations
for published information, since these types will often be used to guide the
automated processing of the record.  I have found it best to avoid the
typical method of defining an ever expanding, flat list of types as in Lout
- Book, Proceedings, PhDThesis, TechReport, MastersThesis, Misc, Article,
InBook, InProceedings.  Instead of this approach, a reference type is
defined as a bit-mask of attributes - IS_A_THESIS, IS_A_MEETING, IS_A_BOOK,
etc.  The overall 'type' of any reference is simply the bitmapped value of
all it's attributes.  When you need a new attribute - like
IS_ELECTRONIC_ONLY - you simply add a new bit position.  Taking this
approach, what would be a long list of types collapses to a much shorter
list of properties.  As new types emerge that are simply a new 'combination'
of existing properties - IS_A_MEETING & IS_ELECTRONIC_ONLY - this
combination can be represented without any changes to the underlying
infrastructure.  As new types emerge that include new attributes, you simply
add the required attribute to the list.  The other benefit of this approach
is that you get implied 'relationships' between types, based on shared
attributes.  For instance - MastersThesis & PhDThesis are automatically
associated via their shared attribute IS_A_THESIS.  A flat-list description
of such types does not implicitly encode this relationship.  Finally, if
references are encoded in a hierarchical form - as it done to some extent in
the MedLine DTD - then a type that IS_A_CHAPTER in a book gains it's
'bookishness' from it's relationship to a parent reference that IS_A_BOOK.
If you don't adopt this hierarchical approach, one ends up flatting out
these hierarchical relationships as well thus necessitating writing code
later to 'separate' these attributes.

	2) General Granularity Issues: It is import to include a certain
amount of redundancy in the record structure for references.  As one of many
examples, take the proposed AuthorNames property.  It's very useful to have
the required granularity in this property so that individual author names
can be separated.  It is also very important to distinguish the individual
name components - first, last, middle, patronym, title, etc.. - so as to be
able to perform such tasks as 'Return all references by author X'.
Unfortunately, the many sources of publication information often use
different schemes for representing author names.  The problem becomes even
more complex when you have to process names from various languages.  The
best solution I've found to this problem is to have the ability to store
AuthorNames in an unparsed form, so that one is not precluded from importing
a reference, if one has yet to code a parser for the AuthorName field.  This
is especially useful in a project such as biojava, as the many contributors
to the project can write the required parsers as the need arises.  The
parsers would all work against the same 'unparsed' field and can be driven
by additional attributes such as - in the case of AuthorNames - 'language',
'includes full first names', 'includes FI & MI', 'initials follow last
name', etc.  This principle extends to nearly every component of Reference,
including Citation information (Journal, Volume, Issue|Number, pagination),
date of publication, title, abstract, keywords, etc.

	3) I completely concur with Mr. Down's suggestion that XML be a part
of the design consideration from the outset.  The publishing industry has
learned the hard way - i.e., at great *cost* - that manipulation of this
info in electronic form must be done in as structured an environment as
possible.  SGML was the result of the recognition - though it became quite
complex, due to the complex nature of the information representation
required and the tendency of different publishers to *push* their own
standard.  It was only the dawning of the web - and the recognition of the
mountains of money to made by fully-automating the process of re-purposing
publication information - that has brought some order to this system in the
form of the various XML projects - XML, XSL, XLL, etc.  Enough
editorializing, I cast my vote for the biojava implementation being
XML-aware.  The only difficulty - and it's not a triviality, as the biojava
persistence discussions made clear - is interfacing to the various DTDs
and/or object representations currently in use.  This is where the MedLine
XML DTD may be of use, as it will cover a broad swath of the life science
literature - though by no means *all* of the relevant publications.  For
meeting citations, patents, books, theses, etc., one would need to go
elsewhere.

This is definitely an important complement to the wonderful work you've been
doing with references to sequence & structural data.

Best of luck!

Cheers,
Bill Bug
Knowledge Schema Engineer

billb@doubletwist.com
510 587 5781

DoubleTwist, Inc.
1100 Harrison Street
Suite 1100
Oakland, CA		94612



-----Original Message-----
From: Thomas Down [mailto:td2@sanger.ac.uk]
Sent: Tuesday, May 30, 2000 9:54 AM
To: Matthew Pocock
Cc: biojava-l@biojava.org
Subject: Re: [Biojava-l] references


Okay, I'm actually just across the office from Matt, but I'm posting
to the list to kick the discussion off.

I'm also keen to see Reference objects in BioJava (is this such
a good name?  We already have java.lang.ref.Reference and
javax.naming.Reference.  But no matter).  The real issue behind
references is that they're rather complicated.  My main experience
of bibliographical databases is Lout's @Reference objects. (For
those who've never seen Lout, it's like LaTeX, but simpler syntax.
See http://snark.ptc.spbu.ru/~uwe/lout/lout.html)

Anyway, Lout references have a /lot/ of possible properties.  One
of the properties is an @Type string, which indicates which of the
other types are used, and how the reference should be printed.  Valid
types are

  Book, Proceedings, PhDThesis, TechReport, MastersThesis, Misc,
Article, InBook, InProceedings

(and I think that since this system was designed, things have got
even worse.  For instance, there certainly ought to be a WebSite
type, to use when referencing BioJava :).

There are several options that could be taken:

  - Have a baroque reference object with lots of properties
    (modelled on Lout?).  This would actually work quite well,
    but it makes me rather uneasy...

  - Model reference types on the basis of polymorphism, e.g.
    BookReference, ArticleReference, WebSiteReference.  This might
    be quite hard to use in practice, though...

  - Have a `core reference' interface containing only slightly
    more than what Matt suggested, then add extra fields on a
    tag-value basis.

At the moment, my current feeling is that option 3 might be
the easiest route to sanity, but it's certainly something that's
worth discussing.

Other interesting question: I agree (at least in principle, might
want to thrash the details slightly) with the idea of modelling
Authors as their own class.  But what about other cases?  Should
a web-site URL be a String or a java.net.URL, or what?  How about
Journal objects (which might be quite helpful in some cases, but
could also really get in the way at other times).

We've already had a bit of talk of persistance on the list,
and References are a case where this is really worth thinking 
about right from day one.  I'd like to see whatever type of
Reference objects we agree on having a nice way of storing them
in (at least) XML format.


Any more thoughts,

    Thomas.

On Tue, May 30, 2000 at 05:29:14PM +0100, Matthew Pocock wrote:
> Dear all,
> 
> References to literature are things that need representing in EMBL,
> GENBANK, SwissProt entries (and almost certainly in other contexts). It
> would be nice if we could supply a Reference interface under
> org.biojava.utils - what should the interface contain?
> 
> interface Reference {
>   // a List of Author objects
>   List getAuthors();
> 
>   String getJournal();
>   int getVolume();
>   int getStartPage();
>   int getEndPage();
> }
> 
> interface Author {
>   char [] getInitials();
>   String getSurname();
> }
> 
> What is missing? What shouldn't be here? Is it needed at all?

-- 
There are whose study is of smells
And to attentive schools rehearse
How something mixed with something else
Makes something worse.

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l