New program: makeseq

Fri Sep 24 15:02:39 UTC 2004

Hi Henrikki

Sounds very useful.  I've tried to address all your points (below).

Cheers

Jon

Henrikki Almusa wrote:
> 
> Hello,
> 
> I have written a new program for emboss. The code is made against emboss-2.9.0
> and the program is called 'makeseq'. It creates random sequences, but since
> the biological world isn't quite that random, it can use either pepstats
> output (for proteins) or cusp output (for nucleotides) to create a
> distribution. This should give users the ability to create random sequences
> biased according to their own sequence triplet or amino acid distributions.
> The program also allows inserting a given sequence (insert) within the
> created sequence. However, I've encountered a few problems where I need help.
> 
> 1. Acd handling
> I've tried to make the program query something depending on other selection.
> Sequence type should be asked if there is no distribution file and start
> point of the insertion should be asked if insert has been given. I can't make
> it query the these properly properly.

The cleanest way to do this is to use a "Toggle" ACD data item.
e.g. 

toggle: distro  
[
  standard: "Y"
  information: "Do you want to use make an insert?"
  default: "N"
]	

integer: start  
[
    standard: "$(retain)"
    information: "Start point of inserted sequence"
    default: "1"
]

You don't need to prompt the user for sequnce type though, because
"sequence"
data items have attributes:

sequence: sequence 
[ 
  parameter: "Y" 
  type: protein 
] 

sequence.begin (start residue, i.e. -sbegin value) 
sequence.end (end residue, i.e. -send value) 
sequence.length (length)
sequence.protein (true if sequence is protein)
sequence.nucleic (true if sequence is nucleic)
sequence.name (name)
sequence.weight (alignment weight for a seqset)
sequence.count (no. of sequences in a seqset)

You access them in ACD by e.g. $(sequence.begin) etc.
e.g. to ensure your insert isn't past the end of the sequence use
maximum: $(sequence.end) 

> 
> 2. Segfaults
> The program segfaults when asked to make a nucleotide sequence with a given
> insert. This is caused by the inserts sequence type check. The stack trace
> is:
> 
> #0  0x40103531 in cvt_s () from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #1  0x4010487a in ajFmtVfmt () from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #2  0x4010445c in ajFmtVfmtStrCL ()
>    from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #3  0x40104367 in ajFmtPrintS () from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #4  0x40145998 in seqTypeCharDnaGap ()
>    from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #5  0x40144e70 in ajSeqTypeDnaS ()
>    from /work/hena/emboss-2.9.0/lib/libajax.so.0
> #6  0x08049234 in main ()
> #7  0x40466a67 in __libc_start_main () from /lib/i686/libc.so.6
> 
> Protein typechecking works ok.
> 

If you really can't fix it get back in touch and I can run it through
Purify.

> 3. Uniformity
> This problem appears when making pure random sequence. I tried to use
> 'ajax/seqtype.c' lines
> 
> char seqCharProtPure[]  = "ACDEFGHIKLMNPQRSTVWYacdefghiklmnpqrstvwy";
> char seqCharNucPure[]   = "ACGTUacgtu";
> 
> with the following addeitions to the file
> 
> int  seqCharProtPureLength = 40;
> int  seqCharNucPureLength = 10;
> 
> Now, this did not work. Therefore, I just copied them within the program and
> it worked fine. However, I don't think this is the proper way to do, since
> the program doesn't then uses it's own settings for what is good character
> and what is not.

I'm presuming the 10 and 40 are size of your two arrays.  If you want
to treat them as strings you have to leave space for your terminating
NULL, so 41 and 11 would do it.  All abitrary limits really should be
avoided though, use e.g. 

AjPStr seqCharProtPure=NULL;
seqCharProtPure=ajStrNewC("ACDEFGHIKLMNPQRSTVWYacdefghiklmnpqrstvwy");

and ajStrChar to return a single character from a string at a given
position.

 Is there a way to use something more generic, so that if
> emboss changes these things, they would be applied to this program as well?
> 

There might (perhaps should!) be - Alan Bleasby
(ableasby at rfcgr.mrc.ac.uk) is 
the best man to ask about that.

> Any help is most appreciated. I would like to submit this when these things
> are fixed. 'makeseq.c' and 'makeseq.acd' are attached. Also basic help on
> creating help page for makeseq would be appreciated.

I've attached the template I use for the DOMAINATRIX documentation, e.g.
:
http://www.rfcgr.mrc.ac.uk/Software/EMBOSS/Apps/domainatrix/rocon.html
With this template, I document stuff by hand. The only external program
I use is "acdtable" to get the ACD stuff. This is slightly different
from 
the format used for EMBOSS apps though.

> 
> Thanks,
> --
> Henrikki Almusa
> 

Hope this helps and thanks for the interest

Cheers

Jon

>   ------------------------------------------------------------------------
>                   Name: makeseq.acd
>    makeseq.acd    Type: Plain Text (text/plain)
>               Encoding: 7bit
> 
>                 Name: makeseq.c
>    makeseq.c    Type: text/x-csrc
>             Encoding: 7bit

-- 
Jon C. Ison, PhD
Proteomics Applications Group
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
Tel: +44 1223 494500  Fax: +44 1223 494512
E-mail: jison at rfcgr.mrc.ac.uk  Web: http://www.rfcgr.mrc.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss-dev/attachments/20040924/5eef7754/attachment-0001.html>