[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?
Richard Holland
holland at eaglegenomics.com
Fri Nov 6 17:15:28 UTC 2009
Ah OK I see what's going on.
The convenience method you're using, RichSequence.IOTools.readStream
(), uses FastaFormat to try and guess the alphabet to use based on the
first line of the input sequence.
In FastaFormat, it does this by searching for matching non-DNA
symbols. The search is case-sensitive:
protected static final Pattern aminoAcids = Pattern.compile(".*
[FLIPQE].*");
FastaFormat needs patching to make this pattern non-case-sensitive.
Still, if the sequence is such that any of the above symbols don't
appear until the second or subsequent lines, the guessing will not
work and it'll assume it's DNA, and give you the same error as before.
In the circumstances where you know what alphabet the sequence is in
advance, it's best to avoid the guessing algorithms and instead use
the methods such as readFastaDNA that explicity specify the alphabet
you want to read.
However, there's still one thing that you definitely can't do and
that's parse different types of sequence from the same input without
inserting some kind of additional code to detect what alphabet each
individual sequence is using before parsing it using the appropriate
BioJava parser. Your code appears to expecting mixed input, but this
won't work unless they all happen to be the same alphabet.
cheers,
Richard
On 6 Nov 2009, at 16:54, Carl Mäsak wrote:
> Richard (>), Carl (>>):
>>> I'm using RichSequenceIterator to read FASTA files containing
>>> proteins. Somehow it doesn't work when the protein sequences are in
>>> lowercase, which they sometimes are when downloaded from e.g.
>>> Uniprot.
>>> My code fails to recognize the following file as containing a
>>> protein
>>> sequence:
>>>
>>>> OPSD_FELCA
>>>
>>>
>>> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
>>>
>>> lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
>>>
>>> aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
>>>
>>> qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
>>> cmlttlccgknplgddeasttgsktetsqvapa
>>>
>>> What am I missing? Here's the code I'm using to read in sequences:
>>>
>>> private List<ISequence> sequencesFromInputStream(InputStream
>>> stream) {
>>>
>>> BufferedInputStream bufferedStream = new
>>> BufferedInputStream(stream);
>>> Namespace ns = RichObjectFactory.getDefaultNamespace();
>>> RichSequenceIterator seqit = null;
>>>
>>> try {
>>> seqit = RichSequence.IOTools.readStream(bufferedStream,
>>> ns);
>>> } catch (IOException e) {
>>> logger.error("Couldn't read sequences from file", e);
>>> return Collections.emptyList();
>>> }
>>>
>>> List<ISequence> sequences = new ArrayList<ISequence>();
>>> try {
>>> while ( seqit.hasNext() ) {
>>> RichSequence rseq;
>>> rseq = seqit.nextRichSequence(); // *error occurs
>>> here*
>>> if (rseq == null)
>>> continue;
>>> String alphabet = rseq.getAlphabet().getName();
>>> sequences.add(
>>> "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
>>> : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
>>> : new BiojavaProtein
>>> (rseq) );
>>> }
>>> } catch (NoSuchElementException e) {
>>> logger.error("Read past last sequence", e);
>>> } catch (BioException e) {
>>> logger.error(e); // *ends up here*
>>> }
>>>
>>> return sequences;
>>> }
>>>
>>> Grateful for any pointers you might have.
>>
>> Could you post the output from the exception stack that it generates?
>
> org.biojava.bio.BioException: Could not read sequence
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence
> (RichStreamReader.java:113)
> at
> net.bioclipse.biojava.business.BiojavaManager.sequencesFromInputStream
> (BiojavaManager.java:314)
> at net.bioclipse.biojava.business.BiojavaManager.sequencesFromFile
> (BiojavaManager.java:291)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.doInvoke
> (AbstractManagerMethodDispatcher.java:243)
> at
> net.bioclipse.managers.business.JavaManagerMethodDispatcher.doInvokeInSameThread
> (JavaManagerMethodDispatcher.java:248)
> at
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.invoke
> (AbstractManagerMethodDispatcher.java:130)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at net.bioclipse.recording.WrapInProxyAdvice.invoke
> (WrapInProxyAdvice.java:22)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.doInvoke
> (ServiceInvoker.java:59)
> at
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.invoke
> (ServiceInvoker.java:67)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at
> org.springframework.osgi.service.importer.internal.aop.ServiceTCCLInterceptor.invoke
> (ServiceTCCLInterceptor.java:34)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at
> org.springframework.osgi.service.importer.support.LocalBundleContextAdvice.invoke
> (LocalBundleContextAdvice.java:59)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at
> org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed
> (DelegatingIntroductionInterceptor.java:131)
> at
> org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke
> (DelegatingIntroductionInterceptor.java:119)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at org.springframework.aop.framework.JdkDynamicAopProxy.invoke
> (JdkDynamicAopProxy.java:204)
> at $Proxy18.invoke(Unknown Source)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at
> org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke
> (AfterReturningAdviceInterceptor.java:50)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at org.springframework.aop.framework.JdkDynamicAopProxy.invoke
> (JdkDynamicAopProxy.java:204)
> at $Proxy20.sequencesFromFile(Unknown Source)
> at net.bioclipse.biojava.ui.editors.Aligner.setInput(Aligner.java:
> 152)
> at net.bioclipse.biojava.ui.editors.Aligner.init(Aligner.java:138)
> at org.eclipse.ui.part.MultiPageEditorPart.addPage
> (MultiPageEditorPart.java:238)
> at org.eclipse.ui.part.MultiPageEditorPart.addPage
> (MultiPageEditorPart.java:212)
> at net.bioclipse.biojava.ui.editors.SequenceEditor.createPages
> (SequenceEditor.java:47)
> at org.eclipse.ui.part.MultiPageEditorPart.createPartControl
> (MultiPageEditorPart.java:357)
> at org.eclipse.ui.internal.EditorReference.createPartHelper
> (EditorReference.java:662)
> at org.eclipse.ui.internal.EditorReference.createPart
> (EditorReference.java:462)
> at org.eclipse.ui.internal.WorkbenchPartReference.getPart
> (WorkbenchPartReference.java:595)
> at org.eclipse.ui.internal.PartPane.setVisible(PartPane.java:313)
> at org.eclipse.ui.internal.presentations.PresentablePart.setVisible
> (PresentablePart.java:180)
> at
> org.eclipse.ui.internal.presentations.util.PresentablePartFolder.select
> (PresentablePartFolder.java:270)
> at
> org.eclipse.ui.internal.presentations.util.LeftToRightTabOrder.select
> (LeftToRightTabOrder.java:65)
> at
> org.eclipse.ui.internal.presentations.util.TabbedStackPresentation.selectPart
> (TabbedStackPresentation.java:473)
> at org.eclipse.ui.internal.PartStack.refreshPresentationSelection
> (PartStack.java:1256)
> at org.eclipse.ui.internal.PartStack.setSelection(PartStack.java:
> 1209)
> at org.eclipse.ui.internal.PartStack.showPart(PartStack.java:1608)
> at org.eclipse.ui.internal.PartStack.add(PartStack.java:499)
> at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:103)
> at org.eclipse.ui.internal.PartStack.add(PartStack.java:485)
> at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:112)
> at org.eclipse.ui.internal.EditorSashContainer.addEditor
> (EditorSashContainer.java:63)
> at org.eclipse.ui.internal.EditorAreaHelper.addToLayout
> (EditorAreaHelper.java:225)
> at org.eclipse.ui.internal.EditorAreaHelper.addEditor
> (EditorAreaHelper.java:213)
> at org.eclipse.ui.internal.EditorManager.createEditorTab
> (EditorManager.java:778)
> at org.eclipse.ui.internal.EditorManager.openEditorFromDescriptor
> (EditorManager.java:677)
> at org.eclipse.ui.internal.EditorManager.openEditor
> (EditorManager.java:638)
> at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditorBatched
> (WorkbenchPage.java:2854)
> at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditor
> (WorkbenchPage.java:2762)
> at org.eclipse.ui.internal.WorkbenchPage.access$11
> (WorkbenchPage.java:2754)
> at org.eclipse.ui.internal.WorkbenchPage$10.run(WorkbenchPage.java:
> 2705)
> at org.eclipse.swt.custom.BusyIndicator.showWhile
> (BusyIndicator.java:70)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2701)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2685)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2676)
> at org.eclipse.ui.ide.IDE.openEditor(IDE.java:651)
> at org.eclipse.ui.ide.IDE.openEditor(IDE.java:610)
> at org.eclipse.ui.actions.OpenFileAction.openFile
> (OpenFileAction.java:99)
> at org.eclipse.ui.actions.OpenSystemEditorAction.run
> (OpenSystemEditorAction.java:99)
> at org.eclipse.ui.actions.RetargetAction.run(RetargetAction.java:221)
> at org.eclipse.ui.navigator.CommonNavigatorManager$3.open
> (CommonNavigatorManager.java:202)
> at org.eclipse.ui.OpenAndLinkWithEditorHelper$InternalListener.open
> (OpenAndLinkWithEditorHelper.java:48)
> at org.eclipse.jface.viewers.StructuredViewer$2.run
> (StructuredViewer.java:842)
> at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
> at org.eclipse.core.runtime.Platform.run(Platform.java:888)
> at org.eclipse.ui.internal.JFaceUtil$1.run(JFaceUtil.java:48)
> at org.eclipse.jface.util.SafeRunnable.run(SafeRunnable.java:175)
> at org.eclipse.jface.viewers.StructuredViewer.fireOpen
> (StructuredViewer.java:840)
> at org.eclipse.jface.viewers.StructuredViewer.handleOpen
> (StructuredViewer.java:1101)
> at org.eclipse.ui.navigator.CommonViewer.handleOpen
> (CommonViewer.java:467)
> at org.eclipse.jface.viewers.StructuredViewer$6.handleOpen
> (StructuredViewer.java:1205)
> at org.eclipse.jface.util.OpenStrategy.fireOpenEvent
> (OpenStrategy.java:264)
> at org.eclipse.jface.util.OpenStrategy.access$2(OpenStrategy.java:
> 258)
> at org.eclipse.jface.util.OpenStrategy$1.handleEvent
> (OpenStrategy.java:298)
> at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
> at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3543)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1250)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1273)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1258)
> at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1079)
> at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:
> 3441)
> at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3100)
> at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:
> 2405)
> at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2369)
> at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2221)
> at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:500)
> at org.eclipse.core.databinding.observable.Realm.runWithDefault
> (Realm.java:332)
> at org.eclipse.ui.internal.Workbench.createAndRunWorkbench
> (Workbench.java:493)
> at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:
> 149)
> at net.bioclipse.ui.Application.start(Application.java:36)
> at org.eclipse.equinox.internal.app.EclipseAppHandle.run
> (EclipseAppHandle.java:194)
> at
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication
> (EclipseAppLauncher.java:110)
> at
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start
> (EclipseAppLauncher.java:79)
> at org.eclipse.core.runtime.adaptor.EclipseStarter.run
> (EclipseStarter.java:368)
> at org.eclipse.core.runtime.adaptor.EclipseStarter.run
> (EclipseStarter.java:179)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:559)
> at org.eclipse.equinox.launcher.Main.basicRun(Main.java:514)
> at org.eclipse.equinox.launcher.Main.run(Main.java:1311)
> at org.eclipse.equinox.launcher.Main.main(Main.java:1287)
> Caused by: org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l at biojava.org or post
> a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.FastaFormat
> Accession=OPSD_FELCA
> Id=null
> Comments=problem parsing symbols
> Parse_block
> =
> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyillnlavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgvaftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaqqqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrncmlttlccgknplgddeasttgsktetsqvapa
> Stack trace follows ....
>
>
> at org.biojavax.bio.seq.io.FastaFormat.readRichSequence
> (FastaFormat.java:244)
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence
> (RichStreamReader.java:110)
> ... 114 more
> Caused by: org.biojava.bio.symbol.IllegalSymbolException: This
> tokenization doesn't contain character: 'e'
> at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar
> (CharacterTokenization.java:175)
> at org.biojava.bio.seq.io.CharacterTokenization
> $TPStreamParser.characters(CharacterTokenization.java:246)
> at org.biojava.bio.symbol.SimpleSymbolList.<init>
> (SimpleSymbolList.java:178)
> at org.biojavax.bio.seq.io.FastaFormat.readRichSequence
> (FastaFormat.java:237)
> ... 115 more
>
> // Carl
--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the Biojava-l
mailing list