[Biojava-l] Parsing exising gaps

Thu Nov 15 13:51:48 UTC 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think you've uncovered a number of problems here:

1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
all the other predefined alphabets.

2. The MSF parser doesn't bother trying to build GappedSequence
instances, instead it just builds solid sequences with the gaps as
normal symbols.

3. There is no constructor or method for taking a sequence with embedded
gap symbols and turning it into a GappedSequence with separate chunks.

Combined, these three problems make it impossible to do what you want
easily. I will make a note to fix this on the plans for the next BioJava
development cycle.

In the meantime, your best bet would be to construct a second alignment
block by iterating over the alignment block you already have and parsing
the locations of the gap symbols. You would create a
SimpleGappedSequence intially over the ungapped sequence, then use the
insert gap methods to insert the gaps into this ungapped sequence before
putting all the SimpleGappedSequence objects together into a new alignment.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Dear all,
> 
>  
> 
> I have managed to read an MSF-formatted alignment from a file selected
> through FileChooser as follows:
> 
>  
> 
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
> 
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>  
> 
> I can now retrieve the sequence names and sequences through the Alignment
> object:
> 
>  
> 
>   Iterator aLabels = align.getLabels().iterator();
> 
>   Iterator aSequences = align.symbolListIterator();
> 
>  
> 
> However, I now what to be able to translate between real sequence numbers
> and the positions within each alignment string, i.e. retrieve positions that
> remove the gaps first (gaps are represented by hyphens '-' in the MSF
> format). How can I tell BioJava to parse the gaps into an GappedSequence
> format? I have tried the following to check what position 15 (past the the
> first gap) translates into:
> 
>  
> 
>   int n = 0;
> 
>   while(aSequences.hasNext()) {
> 
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> 
>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> 
>       System.out.println(aGapped.gappedToLocation(new PointLocation(15)));
> 
>   }
> 
>  
> 
> But I only get 15 back out. I have also studied the constructor of the
> underlying SimpleGappedSymbolList but it simply copies the SymbolList and
> creates one big block:
> 
>  
> 
>   public SimpleGappedSymbolList(SymbolList source) {
> 
>     this.source = source;
> 
>     this.alpha = source.getAlphabet();
> 
>     this.blocks = new ArrayList();
> 
>     this.length = source.length();
> 
>     Block b = new Block(1, length, 1, length);
> 
>     blocks.add(b);
> 
>   }
> 
>  
> 
> Is there a way to tell SimpleGappedSequence to parse itself in terms of the
> gap characters in the sequence string? How is the sequence represented in
> this case, if not by gaps? Surely the hyphen cannot be a part of the
> standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it?
> 
>  
> 
> Best wishes,
> 
>  
> 
>   Ditlev
> 
>  
> 
> --
> 
>  
> 
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
> 
>  
> 
> Department of Molecular Biology   Office:  +45 89425259
> University of AarhusLab:     +45 89425022
> Gustav Wieds Vej 10cFax:     +45 86123178
> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> Denmark             Lab WWW:  <http://bioxray.dk/~deb> www.bioxray.dk/~deb
> 
>  
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPE704C5LeMEKA/QRAniIAJsGv+5HIP3mCDxBIUdw0SjDrWu8dgCeNviA
EsJK4gv+EVY7wc4r6W2A0+I=
=wCQs
-----END PGP SIGNATURE-----