[Biojava-l] Parsing exising gaps

Fri Nov 16 08:59:41 UTC 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Ditlev.

After some investigation and some helpful hints from Mark, it turns out
that there are methods in DNATools/ProteinTools that can construct
proper GappedSymbolList objects out of strings.

I have managed to modify the MSF parser to use this instead. This means
that the MSF parser will now return instances of GappedSymbolList
(actually GappedSequences to be accurate) rather than SimpleSymbolList.
Thanks to the way the APIs work this will make no difference to existing
users (except those who are depending on being able to cast it to a
certain type - which they shouldn't, because the API doesn't guarantee
it to be of any type!), but it will fix it for you. Future releases will
modify the API (or include a completely new MSF parser) which will
explicitly return GappedSymbolLists in the API declarations rather than
plain SymbolLists, but I can't do that right now because it would break
existing users code.

To get the modified parser you will need to check out the very latest
source code from our CVS repository and compile it using ant.
Instructions are on our website at biojava.org if you have not done this
before.

Hope this helps you.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Hi Richard,
> 
>   thanks for clarifying this and for your useful suggestion, which I've
> managed to implement as shown below. It works nicely, but I was really
> surprised to learn that biojava hasn't yet implemented a proper parsing of
> gap characters from strings into the object structure as this seems central
> to any use of pre-aligned sequences. Also, I find it problematic that the
> API implements the gap characters as part of the alphabets. In my view, this
> breaks the logic of the object model because proteins don't really have gaps
> in their sequences.
> 
>   Rather, the constructor of the Sequence-derived classes ought to throw an
> exception when non-protein characters are passed and should not allow the
> user to create an object with sequence elements that are non-standard.
> Instead, I think there should be a static method that allows cleaning the
> input sequence before passing it to the Sequence constructor. On the other
> hand, the constructor of the GappedSequence-derived classes should recognise
> the gaps and create an object with blocks of legal protein symbols and gaps
> in the appropriate places.
> 
>   -- Ditlev
> 
>   // Read MSF file into Alignment object
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>   // Iterate through sequences in turn
>   Iterator aSequences = align.symbolListIterator();
>   while(aSequences.hasNext()) {
> 
>       // Retrieve SymbolList, the associated gap symbol and sequence string
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
>       String aGappedString = aSym.seqString();
> 
>       // Prepare non-gapped string
>       String aPlainString = "";
> 
>       // Loop through individual symbols and add non-gap characters to
> string
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               aPlainString += aGappedString.charAt(i-1);
> 
>       // Create a new gapped sequence object with the plain (non-gapped)
> sequence
>       SimpleGappedSequence aGapped =
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainString,
> "");
> 
>       // Use separate indices for gapped and plain sequences
>       int n = 1;
> 
>       // Loop through individual gapped sequence symbols and insert gap into
> object when gap symbol is encountered
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               n++;
>           else
>               aGapped.addGapInSource(n); 
> 
> --
>  
> Ditlev Egeskov Brodersen
> Lektor
> Bakkefaldet 30, Hasle
> 8210 Århus V
>  
> www.lindeman-brodersen.dk
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 15 November 2007 14:52
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: [Biojava-l] Parsing exising gaps
>>
> I think you've uncovered a number of problems here:
> 
> 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> all the other predefined alphabets.
> 
> 2. The MSF parser doesn't bother trying to build GappedSequence
> instances, instead it just builds solid sequences with the gaps as
> normal symbols.
> 
> 3. There is no constructor or method for taking a sequence with
> embedded
> gap symbols and turning it into a GappedSequence with separate chunks.
> 
> Combined, these three problems make it impossible to do what you want
> easily. I will make a note to fix this on the plans for the next
> BioJava
> development cycle.
> 
> In the meantime, your best bet would be to construct a second alignment
> block by iterating over the alignment block you already have and
> parsing
> the locations of the gap symbols. You would create a
> SimpleGappedSequence intially over the ungapped sequence, then use the
> insert gap methods to insert the gaps into this ungapped sequence
> before
> putting all the SimpleGappedSequence objects together into a new
> alignment.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>>>> Dear all,
>>>>
>>>>
>>>>
>>>> I have managed to read an MSF-formatted alignment from a file
> selected
>>>> through FileChooser as follows:
>>>>
>>>>
>>>>
>>>>   BufferedReader br = new BufferedReader(new
>>>> FileReader(aFileChooser.getSelectedFile()));
>>>>
>>>>   SimpleAlignment align =
>>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
>>>>
>>>>
>>>> I can now retrieve the sequence names and sequences through the
> Alignment
>>>> object:
>>>>
>>>>
>>>>
>>>>   Iterator aLabels = align.getLabels().iterator();
>>>>
>>>>   Iterator aSequences = align.symbolListIterator();
>>>>
>>>>
>>>>
>>>> However, I now what to be able to translate between real sequence
> numbers
>>>> and the positions within each alignment string, i.e. retrieve
> positions that
>>>> remove the gaps first (gaps are represented by hyphens '-' in the MSF
>>>> format). How can I tell BioJava to parse the gaps into an
> GappedSequence
>>>> format? I have tried the following to check what position 15 (past
> the the
>>>> first gap) translates into:
>>>>
>>>>
>>>>
>>>>   int n = 0;
>>>>
>>>>   while(aSequences.hasNext()) {
>>>>
>>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>>>>
>>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
>>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
>>>>
>>>>       System.out.println(aGapped.gappedToLocation(new
> PointLocation(15)));
>>>>   }
>>>>
>>>>
>>>>
>>>> But I only get 15 back out. I have also studied the constructor of
> the
>>>> underlying SimpleGappedSymbolList but it simply copies the SymbolList
> and
>>>> creates one big block:
>>>>
>>>>
>>>>
>>>>   public SimpleGappedSymbolList(SymbolList source) {
>>>>
>>>>     this.source = source;
>>>>
>>>>     this.alpha = source.getAlphabet();
>>>>
>>>>     this.blocks = new ArrayList();
>>>>
>>>>     this.length = source.length();
>>>>
>>>>     Block b = new Block(1, length, 1, length);
>>>>
>>>>     blocks.add(b);
>>>>
>>>>   }
>>>>
>>>>
>>>>
>>>> Is there a way to tell SimpleGappedSequence to parse itself in terms
> of the
>>>> gap characters in the sequence string? How is the sequence
> represented in
>>>> this case, if not by gaps? Surely the hyphen cannot be a part of the
>>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the use
> of it?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>>
>>>>   Ditlev
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Ditlev E. Brodersen, Ph.D.
>>>> Lektor, Associate Professor
>>>>
>>>>
>>>>
>>>> Department of Molecular Biology   Office:  +45 89425259
>>>> University of AarhusLab:     +45 89425022
>>>> Gustav Wieds Vej 10cFax:     +45 86123178
>>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
>>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> www.bioxray.dk/~deb
>>>>
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
Rv73CVmtYpEz+apJwM1L3sA=
=UPU6
-----END PGP SIGNATURE-----