[Biojava-l] Parsing exising gaps

Fri Nov 16 09:28:40 UTC 2007

Hi Richard,

  thanks for your super fast reply. I managed to recompile using CVS/ant and
the MSF import now works brilliantly and simply as follows:

  BufferedReader br = new BufferedReader(new
FileReader(aFileChooser.getSelectedFile()));
  SimpleAlignment align =
(SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);

  // Iterate through sequences in turn
  Iterator aSequences = align.symbolListIterator();
  while(aSequences.hasNext()) {

      // Retrieve gapped sequence
      SimpleGappedSequence aGapped =
(SimpleGappedSequence)aSequences.next();

      ...do whatever with each gapped sequence
  }

  The returned gapped sequences are all properly set up with gaps, name etc.
But as for other users, I think there may be some problems, since the
SimpleAlignment object only has a general symbol list iterator, the user
will have to cast each statement extracting a sequence object, and

      SimpleSequence aSimple = (SimpleSequence)aSequences.next();

returns an ClassCastException at run time. So old code might not run with
the update as far as I can see.

  Ditlev 

--

Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor

Department of Molecular Biology   Office:  +45 89425259
University of Aarhus              Lab:     +45 89425022
Gustav Wieds Vej 10c              Fax:     +45 86123178
DK-8000 Aarhus C                  Email:   deb at mb.au.dk
Denmark                           Lab WWW: www.bioxray.dk/~deb
> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 10:00
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] Parsing exising gaps
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Ditlev.
> 
> After some investigation and some helpful hints from Mark, it turns out
> that there are methods in DNATools/ProteinTools that can construct
> proper GappedSymbolList objects out of strings.
> 
> I have managed to modify the MSF parser to use this instead. This means
> that the MSF parser will now return instances of GappedSymbolList
> (actually GappedSequences to be accurate) rather than SimpleSymbolList.
> Thanks to the way the APIs work this will make no difference to
> existing
> users (except those who are depending on being able to cast it to a
> certain type - which they shouldn't, because the API doesn't guarantee
> it to be of any type!), but it will fix it for you. Future releases
> will
> modify the API (or include a completely new MSF parser) which will
> explicitly return GappedSymbolLists in the API declarations rather than
> plain SymbolLists, but I can't do that right now because it would break
> existing users code.
> 
> To get the modified parser you will need to check out the very latest
> source code from our CVS repository and compile it using ant.
> Instructions are on our website at biojava.org if you have not done
> this
> before.
> 
> Hope this helps you.
> 
> cheers,
> Richard
> 
> 
> Ditlev Egeskov Brodersen wrote:
> > Hi Richard,
> >
> >   thanks for clarifying this and for your useful suggestion, which
> I've
> > managed to implement as shown below. It works nicely, but I was
> really
> > surprised to learn that biojava hasn't yet implemented a proper
> parsing of
> > gap characters from strings into the object structure as this seems
> central
> > to any use of pre-aligned sequences. Also, I find it problematic that
> the
> > API implements the gap characters as part of the alphabets. In my
> view, this
> > breaks the logic of the object model because proteins don't really
> have gaps
> > in their sequences.
> >
> >   Rather, the constructor of the Sequence-derived classes ought to
> throw an
> > exception when non-protein characters are passed and should not allow
> the
> > user to create an object with sequence elements that are non-
> standard.
> > Instead, I think there should be a static method that allows cleaning
> the
> > input sequence before passing it to the Sequence constructor. On the
> other
> > hand, the constructor of the GappedSequence-derived classes should
> recognise
> > the gaps and create an object with blocks of legal protein symbols
> and gaps
> > in the appropriate places.
> >
> >   -- Ditlev
> >
> >   // Read MSF file into Alignment object
> >   BufferedReader br = new BufferedReader(new
> > FileReader(aFileChooser.getSelectedFile()));
> >   SimpleAlignment align =
> > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
> >
> >   // Iterate through sequences in turn
> >   Iterator aSequences = align.symbolListIterator();
> >   while(aSequences.hasNext()) {
> >
> >       // Retrieve SymbolList, the associated gap symbol and sequence
> string
> >       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
> >       String aGappedString = aSym.seqString();
> >
> >       // Prepare non-gapped string
> >       String aPlainString = "";
> >
> >       // Loop through individual symbols and add non-gap characters
> to
> > string
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > aPlainString += aGappedString.charAt(i-1);
> >
> >       // Create a new gapped sequence object with the plain (non-
> gapped)
> > sequence
> >       SimpleGappedSequence aGapped =
> >
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainSt
> ring,
> > "");
> >
> >       // Use separate indices for gapped and plain sequences
> >       int n = 1;
> >
> >       // Loop through individual gapped sequence symbols and insert
> gap into
> > object when gap symbol is encountered
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > n++;
> >           else
> > aGapped.addGapInSource(n);
> >
> > --
> >
> > Ditlev Egeskov Brodersen
> > Lektor
> > Bakkefaldet 30, Hasle
> > 8210 Århus V
> >
> > www.lindeman-brodersen.dk
> >
> >> -----Original Message-----
> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
> >> Sent: 15 November 2007 14:52
> >> To: Ditlev Egeskov Brodersen
> >> Cc: biojava-l at biojava.org
> >> Subject: Re: [Biojava-l] Parsing exising gaps
> >>
> > I think you've uncovered a number of problems here:
> >
> > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> > all the other predefined alphabets.
> >
> > 2. The MSF parser doesn't bother trying to build GappedSequence
> > instances, instead it just builds solid sequences with the gaps as
> > normal symbols.
> >
> > 3. There is no constructor or method for taking a sequence with
> > embedded
> > gap symbols and turning it into a GappedSequence with separate
> chunks.
> >
> > Combined, these three problems make it impossible to do what you want
> > easily. I will make a note to fix this on the plans for the next
> > BioJava
> > development cycle.
> >
> > In the meantime, your best bet would be to construct a second
> alignment
> > block by iterating over the alignment block you already have and
> > parsing
> > the locations of the gap symbols. You would create a
> > SimpleGappedSequence intially over the ungapped sequence, then use
> the
> > insert gap methods to insert the gaps into this ungapped sequence
> > before
> > putting all the SimpleGappedSequence objects together into a new
> > alignment.
> >
> > cheers,
> > Richard
> >
> > Ditlev Egeskov Brodersen wrote:
> >>>> Dear all,
> >>>>
> >>>>
> >>>>
> >>>> I have managed to read an MSF-formatted alignment from a file
> > selected
> >>>> through FileChooser as follows:
> >>>>
> >>>>
> >>>>
> >>>>   BufferedReader br = new BufferedReader(new
> >>>> FileReader(aFileChooser.getSelectedFile()));
> >>>>
> >>>>   SimpleAlignment align =
> >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> > br);
> >>>>
> >>>>
> >>>> I can now retrieve the sequence names and sequences through the
> > Alignment
> >>>> object:
> >>>>
> >>>>
> >>>>
> >>>>   Iterator aLabels = align.getLabels().iterator();
> >>>>
> >>>>   Iterator aSequences = align.symbolListIterator();
> >>>>
> >>>>
> >>>>
> >>>> However, I now what to be able to translate between real sequence
> > numbers
> >>>> and the positions within each alignment string, i.e. retrieve
> > positions that
> >>>> remove the gaps first (gaps are represented by hyphens '-' in the
> MSF
> >>>> format). How can I tell BioJava to parse the gaps into an
> > GappedSequence
> >>>> format? I have tried the following to check what position 15 (past
> > the the
> >>>> first gap) translates into:
> >>>>
> >>>>
> >>>>
> >>>>   int n = 0;
> >>>>
> >>>>   while(aSequences.hasNext()) {
> >>>>
> >>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >>>>
> >>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> >>>>
> >>>>       System.out.println(aGapped.gappedToLocation(new
> > PointLocation(15)));
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> But I only get 15 back out. I have also studied the constructor of
> > the
> >>>> underlying SimpleGappedSymbolList but it simply copies the
> SymbolList
> > and
> >>>> creates one big block:
> >>>>
> >>>>
> >>>>
> >>>>   public SimpleGappedSymbolList(SymbolList source) {
> >>>>
> >>>>     this.source = source;
> >>>>
> >>>>     this.alpha = source.getAlphabet();
> >>>>
> >>>>     this.blocks = new ArrayList();
> >>>>
> >>>>     this.length = source.length();
> >>>>
> >>>>     Block b = new Block(1, length, 1, length);
> >>>>
> >>>>     blocks.add(b);
> >>>>
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> Is there a way to tell SimpleGappedSequence to parse itself in
> terms
> > of the
> >>>> gap characters in the sequence string? How is the sequence
> > represented in
> >>>> this case, if not by gaps? Surely the hyphen cannot be a part of
> the
> >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the
> use
> > of it?
> >>>>
> >>>>
> >>>> Best wishes,
> >>>>
> >>>>
> >>>>
> >>>>   Ditlev
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>>
> >>>> Ditlev E. Brodersen, Ph.D.
> >>>> Lektor, Associate Professor
> >>>>
> >>>>
> >>>>
> >>>> Department of Molecular Biology   Office:  +45 89425259
> >>>> University of AarhusLab:     +45 89425022
> >>>> Gustav Wieds Vej 10cFax:     +45 86123178
> >>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> >>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> > www.bioxray.dk/~deb
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
> Rv73CVmtYpEz+apJwM1L3sA=
> =UPU6
> -----END PGP SIGNATURE-----