[Biojava-l] Languages

Armin Groll ArminG9@compuserve.de
Sat, 20 Oct 2001 17:18:31 +0200


Matthew Pocock schrieb:

> 
> Hi.
> 
> On a related note to the languages, should we rite a SymbolList-centric
> regexp package? It should not be too hard to do. Do people do many
> regexp searches over DNA strings?
> 

I think so. Given a cDNA/RNA-sequence, they regularly do this. Like
searching for TATA-boxes and open reading frames (Biologists' work).
Although the regexps might get very large in certain circumstances.
But I also think it is not a classical 10-line-code-piece to manage it
(quick versions of this). Regular expression pattern matching is the
same as scanning (JFLEX), as I understood it. Since it is not enough to
create a deterministic finite automaton from a regexp-object, but there
are also
algorithms that can match with a complexity of less than
proportional-to-sequence-
length, depending on the pattern of the regexp. And sure, if you work
on this, you want to create something real fast in the end. 

> M
> 
> ps Armin, are cytogenetic loci identical to sequences of DNA, or are
> they labels for regions of these sequences?

Hope I understand the question right. If, then
in a sense, they are real DNA-pieces, thus sequences.
First the basics again (to be complete):
We have '1' that is the whole chromosome one. Mikroscopically seen.
Means sequence and its complement.
'1p' is the short arm of '1', follows the centromere with '1cen' and
then the '1q'. So, '1' is the sequence of '1p', followed by '1cen'
followed by '1q' (or reverse). Not more and not less is in a '1'. This
goes on further down.

Follow the regions ('1pter','1p3','1p2',....) These regions are defined
by alternating dark and light areas. (so for example, if 1p1 is dark,
1p2 is light). This goes on downward (1p1->1p11,1p12,...), since if you
look only long enough through your microscope, you can see the dark and
light areas splitting up again in finer alternating dark and light
areas.
But it is a true problem if at some point in time someone would really
try to map the cytogenetic loci to true DNA-sequences. The nature of the
dark and light areas is not really understood (as I know). They may vary
a little in size (or spread range upon the real sequence) over time
and/or individual (fuzzy borders then), and surely every organism has
other sequences in there. 

Regards,

Armin