[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Richard Holland holland at eaglegenomics.com
Fri Nov 6 17:15:28 UTC 2009


Ah OK I see what's going on.

The convenience method you're using, RichSequence.IOTools.readStream 
(), uses FastaFormat to try and guess the alphabet to use based on the  
first line of the input sequence.

In FastaFormat, it does this by searching for matching non-DNA  
symbols. The search is case-sensitive:

	protected static final Pattern aminoAcids = Pattern.compile(".* 
[FLIPQE].*");

FastaFormat needs patching to make this pattern non-case-sensitive.  
Still, if the sequence is such that any of the above symbols don't  
appear until the second or subsequent lines, the guessing will not  
work and it'll assume it's DNA, and give you the same error as before.  
In the circumstances where you know what alphabet the sequence is in  
advance, it's best to avoid the guessing algorithms and instead use  
the methods such as readFastaDNA that explicity specify the alphabet  
you want to read.

However, there's still one thing that you definitely can't do and  
that's parse different types of sequence from the same input without  
inserting some kind of additional code to detect what alphabet each  
individual sequence is using before parsing it using the appropriate  
BioJava parser. Your code appears to expecting mixed input, but this  
won't work unless they all happen to be the same alphabet.

cheers,
Richard

On 6 Nov 2009, at 16:54, Carl Mäsak wrote:

> Richard (>), Carl (>>):
>>> I'm using RichSequenceIterator to read FASTA files containing
>>> proteins. Somehow it doesn't work when the protein sequences are in
>>> lowercase, which they sometimes are when downloaded from e.g.  
>>> Uniprot.
>>> My code fails to recognize the following file as containing a  
>>> protein
>>> sequence:
>>>
>>>> OPSD_FELCA
>>>
>>>
>>> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
>>>
>>> lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
>>>
>>> aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
>>>
>>> qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
>>> cmlttlccgknplgddeasttgsktetsqvapa
>>>
>>> What am I missing? Here's the code I'm using to read in sequences:
>>>
>>>  private List<ISequence> sequencesFromInputStream(InputStream  
>>> stream) {
>>>
>>>      BufferedInputStream bufferedStream = new
>>> BufferedInputStream(stream);
>>>      Namespace ns = RichObjectFactory.getDefaultNamespace();
>>>      RichSequenceIterator seqit = null;
>>>
>>>      try {
>>>          seqit = RichSequence.IOTools.readStream(bufferedStream,  
>>> ns);
>>>      } catch (IOException e) {
>>>          logger.error("Couldn't read sequences from file", e);
>>>          return Collections.emptyList();
>>>      }
>>>
>>>      List<ISequence> sequences = new ArrayList<ISequence>();
>>>      try {
>>>          while ( seqit.hasNext() ) {
>>>              RichSequence rseq;
>>>                  rseq = seqit.nextRichSequence(); // *error occurs  
>>> here*
>>>              if (rseq == null)
>>>                  continue;
>>>              String alphabet = rseq.getAlphabet().getName();
>>>              sequences.add(
>>>                    "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
>>>                  : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
>>>                  :                          new BiojavaProtein 
>>> (rseq) );
>>>          }
>>>      } catch (NoSuchElementException e) {
>>>          logger.error("Read past last sequence", e);
>>>      } catch (BioException e) {
>>>          logger.error(e); // *ends up here*
>>>      }
>>>
>>>      return sequences;
>>>  }
>>>
>>> Grateful for any pointers you might have.
>>
>> Could you post the output from the exception stack that it generates?
>
> org.biojava.bio.BioException: Could not read sequence
> 	at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence 
> (RichStreamReader.java:113)
> 	at  
> net.bioclipse.biojava.business.BiojavaManager.sequencesFromInputStream 
> (BiojavaManager.java:314)
> 	at net.bioclipse.biojava.business.BiojavaManager.sequencesFromFile 
> (BiojavaManager.java:291)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at  
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.doInvoke 
> (AbstractManagerMethodDispatcher.java:243)
> 	at  
> net.bioclipse.managers.business.JavaManagerMethodDispatcher.doInvokeInSameThread 
> (JavaManagerMethodDispatcher.java:248)
> 	at  
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.invoke 
> (AbstractManagerMethodDispatcher.java:130)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at net.bioclipse.recording.WrapInProxyAdvice.invoke 
> (WrapInProxyAdvice.java:22)
> 	at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at  
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.doInvoke 
> (ServiceInvoker.java:59)
> 	at  
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.invoke 
> (ServiceInvoker.java:67)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at  
> org.springframework.osgi.service.importer.internal.aop.ServiceTCCLInterceptor.invoke 
> (ServiceTCCLInterceptor.java:34)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at  
> org.springframework.osgi.service.importer.support.LocalBundleContextAdvice.invoke 
> (LocalBundleContextAdvice.java:59)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at  
> org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed 
> (DelegatingIntroductionInterceptor.java:131)
> 	at  
> org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke 
> (DelegatingIntroductionInterceptor.java:119)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke 
> (JdkDynamicAopProxy.java:204)
> 	at $Proxy18.invoke(Unknown Source)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at  
> org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke 
> (AfterReturningAdviceInterceptor.java:50)
> 	at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed 
> (ReflectiveMethodInvocation.java:171)
> 	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke 
> (JdkDynamicAopProxy.java:204)
> 	at $Proxy20.sequencesFromFile(Unknown Source)
> 	at net.bioclipse.biojava.ui.editors.Aligner.setInput(Aligner.java: 
> 152)
> 	at net.bioclipse.biojava.ui.editors.Aligner.init(Aligner.java:138)
> 	at org.eclipse.ui.part.MultiPageEditorPart.addPage 
> (MultiPageEditorPart.java:238)
> 	at org.eclipse.ui.part.MultiPageEditorPart.addPage 
> (MultiPageEditorPart.java:212)
> 	at net.bioclipse.biojava.ui.editors.SequenceEditor.createPages 
> (SequenceEditor.java:47)
> 	at org.eclipse.ui.part.MultiPageEditorPart.createPartControl 
> (MultiPageEditorPart.java:357)
> 	at org.eclipse.ui.internal.EditorReference.createPartHelper 
> (EditorReference.java:662)
> 	at org.eclipse.ui.internal.EditorReference.createPart 
> (EditorReference.java:462)
> 	at org.eclipse.ui.internal.WorkbenchPartReference.getPart 
> (WorkbenchPartReference.java:595)
> 	at org.eclipse.ui.internal.PartPane.setVisible(PartPane.java:313)
> 	at org.eclipse.ui.internal.presentations.PresentablePart.setVisible 
> (PresentablePart.java:180)
> 	at  
> org.eclipse.ui.internal.presentations.util.PresentablePartFolder.select 
> (PresentablePartFolder.java:270)
> 	at  
> org.eclipse.ui.internal.presentations.util.LeftToRightTabOrder.select 
> (LeftToRightTabOrder.java:65)
> 	at  
> org.eclipse.ui.internal.presentations.util.TabbedStackPresentation.selectPart 
> (TabbedStackPresentation.java:473)
> 	at org.eclipse.ui.internal.PartStack.refreshPresentationSelection 
> (PartStack.java:1256)
> 	at org.eclipse.ui.internal.PartStack.setSelection(PartStack.java: 
> 1209)
> 	at org.eclipse.ui.internal.PartStack.showPart(PartStack.java:1608)
> 	at org.eclipse.ui.internal.PartStack.add(PartStack.java:499)
> 	at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:103)
> 	at org.eclipse.ui.internal.PartStack.add(PartStack.java:485)
> 	at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:112)
> 	at org.eclipse.ui.internal.EditorSashContainer.addEditor 
> (EditorSashContainer.java:63)
> 	at org.eclipse.ui.internal.EditorAreaHelper.addToLayout 
> (EditorAreaHelper.java:225)
> 	at org.eclipse.ui.internal.EditorAreaHelper.addEditor 
> (EditorAreaHelper.java:213)
> 	at org.eclipse.ui.internal.EditorManager.createEditorTab 
> (EditorManager.java:778)
> 	at org.eclipse.ui.internal.EditorManager.openEditorFromDescriptor 
> (EditorManager.java:677)
> 	at org.eclipse.ui.internal.EditorManager.openEditor 
> (EditorManager.java:638)
> 	at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditorBatched 
> (WorkbenchPage.java:2854)
> 	at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditor 
> (WorkbenchPage.java:2762)
> 	at org.eclipse.ui.internal.WorkbenchPage.access$11 
> (WorkbenchPage.java:2754)
> 	at org.eclipse.ui.internal.WorkbenchPage$10.run(WorkbenchPage.java: 
> 2705)
> 	at org.eclipse.swt.custom.BusyIndicator.showWhile 
> (BusyIndicator.java:70)
> 	at org.eclipse.ui.internal.WorkbenchPage.openEditor 
> (WorkbenchPage.java:2701)
> 	at org.eclipse.ui.internal.WorkbenchPage.openEditor 
> (WorkbenchPage.java:2685)
> 	at org.eclipse.ui.internal.WorkbenchPage.openEditor 
> (WorkbenchPage.java:2676)
> 	at org.eclipse.ui.ide.IDE.openEditor(IDE.java:651)
> 	at org.eclipse.ui.ide.IDE.openEditor(IDE.java:610)
> 	at org.eclipse.ui.actions.OpenFileAction.openFile 
> (OpenFileAction.java:99)
> 	at org.eclipse.ui.actions.OpenSystemEditorAction.run 
> (OpenSystemEditorAction.java:99)
> 	at org.eclipse.ui.actions.RetargetAction.run(RetargetAction.java:221)
> 	at org.eclipse.ui.navigator.CommonNavigatorManager$3.open 
> (CommonNavigatorManager.java:202)
> 	at org.eclipse.ui.OpenAndLinkWithEditorHelper$InternalListener.open 
> (OpenAndLinkWithEditorHelper.java:48)
> 	at org.eclipse.jface.viewers.StructuredViewer$2.run 
> (StructuredViewer.java:842)
> 	at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
> 	at org.eclipse.core.runtime.Platform.run(Platform.java:888)
> 	at org.eclipse.ui.internal.JFaceUtil$1.run(JFaceUtil.java:48)
> 	at org.eclipse.jface.util.SafeRunnable.run(SafeRunnable.java:175)
> 	at org.eclipse.jface.viewers.StructuredViewer.fireOpen 
> (StructuredViewer.java:840)
> 	at org.eclipse.jface.viewers.StructuredViewer.handleOpen 
> (StructuredViewer.java:1101)
> 	at org.eclipse.ui.navigator.CommonViewer.handleOpen 
> (CommonViewer.java:467)
> 	at org.eclipse.jface.viewers.StructuredViewer$6.handleOpen 
> (StructuredViewer.java:1205)
> 	at org.eclipse.jface.util.OpenStrategy.fireOpenEvent 
> (OpenStrategy.java:264)
> 	at org.eclipse.jface.util.OpenStrategy.access$2(OpenStrategy.java: 
> 258)
> 	at org.eclipse.jface.util.OpenStrategy$1.handleEvent 
> (OpenStrategy.java:298)
> 	at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
> 	at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3543)
> 	at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1250)
> 	at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1273)
> 	at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1258)
> 	at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1079)
> 	at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java: 
> 3441)
> 	at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3100)
> 	at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java: 
> 2405)
> 	at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2369)
> 	at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2221)
> 	at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:500)
> 	at org.eclipse.core.databinding.observable.Realm.runWithDefault 
> (Realm.java:332)
> 	at org.eclipse.ui.internal.Workbench.createAndRunWorkbench 
> (Workbench.java:493)
> 	at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java: 
> 149)
> 	at net.bioclipse.ui.Application.start(Application.java:36)
> 	at org.eclipse.equinox.internal.app.EclipseAppHandle.run 
> (EclipseAppHandle.java:194)
> 	at  
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication 
> (EclipseAppLauncher.java:110)
> 	at  
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start 
> (EclipseAppLauncher.java:79)
> 	at org.eclipse.core.runtime.adaptor.EclipseStarter.run 
> (EclipseStarter.java:368)
> 	at org.eclipse.core.runtime.adaptor.EclipseStarter.run 
> (EclipseStarter.java:179)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:559)
> 	at org.eclipse.equinox.launcher.Main.basicRun(Main.java:514)
> 	at org.eclipse.equinox.launcher.Main.run(Main.java:1311)
> 	at org.eclipse.equinox.launcher.Main.main(Main.java:1287)
> Caused by: org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l at biojava.org or post
> a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.FastaFormat
> Accession=OPSD_FELCA
> Id=null
> Comments=problem parsing symbols
> Parse_block 
> = 
> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyillnlavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgvaftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaqqqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrncmlttlccgknplgddeasttgsktetsqvapa
> Stack trace follows ....
>
>
> 	at org.biojavax.bio.seq.io.FastaFormat.readRichSequence 
> (FastaFormat.java:244)
> 	at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence 
> (RichStreamReader.java:110)
> 	... 114 more
> Caused by: org.biojava.bio.symbol.IllegalSymbolException: This
> tokenization doesn't contain character: 'e'
> 	at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar 
> (CharacterTokenization.java:175)
> 	at org.biojava.bio.seq.io.CharacterTokenization 
> $TPStreamParser.characters(CharacterTokenization.java:246)
> 	at org.biojava.bio.symbol.SimpleSymbolList.<init> 
> (SimpleSymbolList.java:178)
> 	at org.biojavax.bio.seq.io.FastaFormat.readRichSequence 
> (FastaFormat.java:237)
> 	... 115 more
>
> // Carl

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the Biojava-l mailing list