[EMBOSS] Multiplatform filenames (was Re: Masking the : character?)

José R. Valverde jrvalverde at cnb.uam.es
Mon Jun 20 08:55:20 UTC 2005


On Sat, 18 Jun 2005 17:28:16 +0800
<yezhiqiang at gmail.com> wrote:
> I have also found this.
> and \:  or using quote cannot solve this problem.
> 
> But why not just rename your file name? It doesn't bother.
> 
> 
> 2005/6/17, Martin Sarachu <msarachu at biol.unlp.edu.ar>:
> > Dear list,
> > 
> > is there any way to mask the ':' character so it is not interpreted as a
> > delimiter for DB:sequence?

	Renaming. 
	---------

Or in other words (caution, detailed explanation follows):

    Why should anybody have a database or db. file named something\ or 
something\\\?

But the fact is that by Unix filesystem semantics that is allowed. So,
there is no easy way to avoid the ':' problem as one must acommodate for
this. Specially since :: is also meningful to EMBOSS. One should introduce
the notion of a special scape metacharacter or a quotation method, and
while at it, it should integrate easily with shells... meaning that it 
should not be pre-processed by the shell (e.g. 'file:name' would come out
of the shell as file:name, the user would need to type "'file:name'" or
some other such horrible combination to escape shell quotations too).

The problem arises because the ':' is used for historic reasons as a
carry-over from VMS where it had special meaning on pathnames. This 
does not hold on UNIX where it is a legit character (actually ANY char
but '/' and NULL is a legit character on UNIX). This is important as
EMBOSS may be used on many locales, and you don't know in advance
how a given symbol will be represented on them. Freedom comes at a 
cost.

QUICK SOLUTION
- ------------
I think that for the user it is simpler to know that ':' has a special
meaning and should be avoided.

For the cases where the colon is generated automatically, it may be better
to provide a renaming script that changes the colon to something else.


UI 'PRO' APPROACH
- ---------------
For GUI writers it is probably better to "translate" any such filenames
between the user and EMBOSS. Note the quotes around translate above: it
is not immediate. Let me explain:

	Escaping for the *command line* must be done using some character 
that is a) meaningful (but those are mostly already taken) and b) easy 
to type on a keyboard. In any case, this means that the user must be aware
of the special case, and if so, renaming is just as good a solution.

	Escaping for the GUI removes all conditions and gives you full
freedom. There are useful tricks to use special quoting/escaping chars
on GUIs (hint: look into ASCII 0-32), but translating filenames can NOT
be done transparently to the user (unless you can guarantee yours is
the only user interface they will use). Any translation will change
the filename and make it look differently or even untypable on other
interfaces.

	Note that the problem still remains of distinguishing when a
pathname containing a colon is an actual filename and not a database:file
specification automatically. On a GUI you may assume a :-containing path
is a filename when you are tagging uploaded data or program generated
data, but otherwise you should be cautious, highly cautious. I.e. does
swiss:prot_human refer to the database entry or to the data the user
uploaded and called that way? Is it possible someone has called their
database 'sequencer_files' locally and if so how you distinguish the 
local database of sequencer files from the user batch of sequencer_files:*
uploaded sequences?

	Assuming you can tell, then read on:

	The trick is to create a special hidden directory on each user
directory accessed: e.g. .myGUI-names. Then for every file make a
suitably processed symlink on that subdirectory and call emboss through
the symlink, sort of:

	my-gui-store-file(filename)
	{
		save(filename);
		sym = concatenate(".myGUI-names/", process(filename));
		make_symlink(sym);
	}

	my-gui-emboss-access-file(filename)
	{
		sym = concatenate(".myGUI-names/", process(filename));
		if (!file_exists(sym))
			make_symlink(sym);
		emboss-access(sym);
	}

	process(filename)
	{
		for (p = filename; *p; p++)
			if (*p == ':')
				*p = SUB; // e.g. ASCII 0x1A
	}

And off you go. Why the <SUB>? You should try to substitute the colon by
something that is guaranteed to be portable. You only have either a) the
portable character set (which is all typable) or b) the control character
set (ASCII 0-32) which you may assume will be available everywhere, and
most probably not used in filenames as they are very difficult to type or
use by hand in general. From these we better avoid NUL, BEL, BS, HT, LF,
VT, FF, CR and ESC just in case. But we still have plenty to choose from:
SUB (substitute), CAN (cancel), DLE (data link espcape) have good mnemonics 
for escaping and STX (start of transmission) and ETX (end of transmission) 
for quoting, but these are only suggestions.

That is to say: in the example above we substituted : by <SUB>, because
we only care about this special case. If there were more cases, then full
escaping/quoting might be needed, and then instead we would copy the
filename into a new string and fully quote/escape. 

I suggest the substitution approach since we are doing the encoding *within* 
the file name: anything else (quoting/escaping) will introduce additional 
chars inside the filename and this will reduce the available filename length 
hence making it less transparent and potentially dangerous (should by any 
chance be two filenames on the length limit containing an escapable sequence
and differing only in the last char).

Alternately one may use a hash of the filename instead, but this is more
painful to code, maintain and debug and potentially more wasteful in terms
of space.

Now, the original filenames are in place, and available for the command
line, up/downloads, other user interfaces, etc.. to manage as they wish,
but your GUI is no longer haunted by the infamous colon.

Symlinks on UNIX eat very little space: usually just the directory
entry. If space is very tight and becomes a concern you may consider
either hardlinks or only symlinking special filenames (this last at
the cost of additionally complex logic). With current hard disks I
wouldn't worry.

And, yes, I know this involves many more changes to a UI, but either
users accommodate (by avoiding the colon) or the UI does (by hidding
limitations).

Actually this a similar trick is used by NetATalk, AppleTalk, MacOS X 
and other systems that have similar metadata problems.

				j

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20050620/be157c6f/attachment.sig>


More information about the EMBOSS mailing list