[Open-bio-l] [Biohackathon] two flatfile spec change proposals (fwd)

Jason Stajich jason@cgt.mc.duke.edu
Fri, 19 Jul 2002 09:52:15 -0400 (EDT)


fwded so all can see.

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu

---------- Forwarded message ----------
Date: Fri, 19 Jul 2002 00:08:38 -0600
From: Andrew Dalke <adalke@mindspring.com>
Reply-To: dalke@dalkescientific.com
To: biohackathon@egenetics.com
Subject: [Biohackathon] two flatfile spec change proposals

I would like to make a couple small changes in the current flat-file
indexing spec.

 1. =================

I want to support this style of indexing:

  mindy --create MyIndex
  mindy --update MyIndex filename1.gb
  mindy --update MyIndex filename2.gb
     ....

That is,  create an  empty index  then add in  new files.   The format
would be determined and set automatically from the first update.

This can't quite be done with the current spec because I need to
define a 'format' in the config information.  I could think of two
alternatives.  One is to define the format when creating the empty
index, as in

  mindy --create MyIndex --format genbank
  mindy --update MyIndex filename1.gb
  mindy --update MyIndex filename2.gb

The other is to create the index and update with a file in one step,

  mindy --create Myindex filename1.gb
  mindy --update MyIndex filename2.gb


I don't like the first option because I want to support auto format
detection.  I don't like the second because of the lack of symmetry.
It's a bit trickier and error prone for code to special case the first
file.  (In other words, it feels wrong.)


I would like to change the flatfile spec slightly and want to
run it by you all first.

I came up with three possibilities

  1.  allow the format to be unspecified, as in the empty string "".
      If unspecified, the update program is in charge of setting the
      format later on.

  2.  instead of the empty string "", use the word "sequence"

  3.  put the format definition in the 'fileids' normalization table.
      Currently this looks like

         <fileid> "\t" <filename> "\t" <file length> "\n"

      In the proposed version it becomes

         <fileid> "\t" <filename> "\t" <file length> "\t" <format> "\n"

      along with the additional constraint that all <format> fields
      must be identical.


Of these I prefer the third.

 2. ===========================

In addition, on rereading the spec I noticed that the filename section
specifically says:

> By definition, the fileid and filename strings are not permitted to
> contain ASCII control characters so no character escape machanism is
> needed for the two terms.  If you use a filename with non-printable
> characters, you get what you deserve.  :)

I'm worried now about internationalized filenames.  Suppose you use a
MS Windows with cp1208 encoding, or one with UTF-8 encoding, etc.  How
is it represented in the file?  Always as UTF-8?  I also don't know how
well those names port across different file systems, nor do I know if
the contain bytes which can be translated as control characters.  (My
reading of UTF-8 says that it can't.)

I'm okay that the different names (like 'id', 'accession', 'author',
etc.) may only be stored in printable ASCII.  I'm not as happy that
I can't have the file

  /home/dalke/för_Inga/proteins.gb

because the creator of the index has less control over the file structure.

I think the best solution for now is to add the following to the spec:

  ... If you use a filename with non-printable characters and know the
  details of Unicode, including using UTF-8 and cp1208 for filesystem
  encodings, let us know so we can fix the spec.

					Andrew
					dalke@dalkescientific.com

_______________________________________________
Biohackathon mailing list
Biohackathon@egenetics.com
http://www.egenetics.com/mailman/listinfo/biohackathon