this format is not readable by seqret

Tue Oct 15 12:09:12 UTC 2002

On Tue, 15 Oct 2002 11:44:28 +0200
Joerg Muehlisch <jmuehlis at uni-muenster.de> wrote:

> Hi,
> 
> in fact I hoped that anybody in the List would know where this format
> comes from. In my file sample I just found some of thes unreadable
> sequences.
> As it does not seem to be a good known format, I will try to find out
> where it is used.
> 
May be it would help if you were able to post a full file sample.
>From the fragments you posted it looked like a sequencing project
file. It mentioned a contig size, with many gel readings of average
length and the orientation coverage of gels (+/- strands).

Iff the sequence contained (you only included a few bases) is just
the consensus, i.e. a single sequence of length exactly equal the
consensus length, then conversion should be trivial to any format.
Simply do a 'tail + 8 {}' 

Otherwise it might contain the gel readings (and the consensus?),
and then it would be a multiple sequence file, possibly with gel
overlaps et al. and conversion may be a bit more difficult. It may
be also that more than one contig and associated files is included in
one file, making processing more difficult.

Initially I would expect the second choice to be true, from the header:
several short sequences making up a contig plus the consensus, in your
example, the first contig would be 506 bases, composed of three gels
of average length 458. Since 1375/3 = 458, I deduce that the consensus
sequence is not included. Therefore you have a multiple sequence file
of overlapping gel readings.

You may try this:

	1) find out if more than one contig is in the file
	2) find out how sequences are separated
	3) decide what you want to do with them, e.g.
		split the file at "^Contig " lines
		strip comment lines (^*:*$)
		split at sequence separators

see csplit(1) for details on how to do it on a pipeline. E.g.
assuming sequences are delimited by a blank line, this _might_
work:
	csplit file /^Contig / -f config
	foreach i ( contig.* )
		tail +8 $i | csplit - /\
\
/ -f ${i}.gel
	end
(note that we need to scape newlines directly) and you'd get the raw 
sequences all right as contig.##.gel.##

				j