[Biojava-l] Fasta parsing question

Andreas Prlic andreas at sdsc.edu
Wed Jun 17 22:52:19 UTC 2015


Hi Toorn,

I can confirm this resulted in an endless loop. I committed a patch for
this plus some junit tests for validation. Please see:

https://github.com/biojava/biojava/issues/282


Also, added documentation to tutorial:

https://github.com/biojava/biojava-tutorial/blob/master/core/readwrite.md


For verification, I just parsed the 10G (gzip compressed) TREMBL fasta file
with <= 100M max memory.

If you update your code, this should start working for you now.

Andreas


On Wed, Jun 17, 2015 at 2:26 AM, Toorn, H.W.P. van den (Henk) <
h.w.p.vandentoorn at uu.nl> wrote:

>  Hi Andreas, thanks very much. I've compiled some (working) code to
> illustrate how I think this should work. The artificial sample fasta file
> contains only one sequence:
>
>
>
> ---------------
> >test test
> PEPTIDEK
>
> ---------------
> If you use a larger FASTA file, the file is first parsed correctly, but
> when it finishes, the loop just continues. I'm aware I'm probably doing
> something wrong in my code, but to me it's just not clear how to do it
> correctly, and that's basically my question.
>
> The code below loops forever, the output is repeating this:
>
> --------------
> 11:18:56 [main] WARN  org.biojava.nbio.core.sequence.io.FastaReader -
> Can't parse sequence 12. Got sequence of length 0!
> 11:18:56 [main] WARN  org.biojava.nbio.core.sequence.io.FastaReader -
> header: test test
> test test
> ---------------
>
> package nl.hecklab.bioinformatics.fastafilereaderexample;
>
> import java.io.IOException;
> import java.io.InputStream;
> import java.util.LinkedHashMap;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.biojava.nbio.core.sequence.ProteinSequence;
> import org.biojava.nbio.core.sequence.compound.AminoAcidCompound;
> import org.biojava.nbio.core.sequence.compound.AminoAcidCompoundSet;
> import org.biojava.nbio.core.sequence.io.FastaReader;
> import org.biojava.nbio.core.sequence.io.GenericFastaHeaderParser;
> import org.biojava.nbio.core.sequence.io.ProteinSequenceCreator;
>
> /**
>  *
>  * @author toorn101
>  */
> public class App {
>
>     public App() {
>         try {
>             InputStream inStream =
> this.getClass().getResourceAsStream("/test.fasta");
>             FastaReader<ProteinSequence, AminoAcidCompound> fastaReader =
> new FastaReader<>(
>                     inStream,
>                     new GenericFastaHeaderParser<ProteinSequence,
> AminoAcidCompound>(),
>                     new
> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()));
>             LinkedHashMap<String, ProteinSequence> b;
>             while ((b = fastaReader.process(10)) != null) {
>                 for (String seq : b.keySet()) {
>                     System.out.println(seq);
>                 }
>             }
>         } catch (IOException ex) {
>             Logger.getLogger(App.class.getName()).log(Level.SEVERE, null,
> ex);
>         }
>     }
>
>     public static void main(String[] args) {
>         new App();
>
>     }
>
> }
>
>
> On 6/17/2015 7:04 AM, Andreas Prlic wrote:
>
> Hi Henk,
>
>  Do you want to share some code-snippets so we can help you debug?
>
>  Thanks,
>
>  Andreas
>
>
>
> On Mon, Jun 15, 2015 at 1:58 AM, Toorn, H.W.P. van den (Henk) <
> h.w.p.vandentoorn at uu.nl> wrote:
>
>> Dear List,
>>
>> I've just started using BioJava 4.0.0 in my projects, and wanted to ask a
>> question about parsing large Fasta files. There is the option to read parts
>> of the fasta file.
>>
>> FastaReader.process(number)
>>
>> The problem I have is that it's not documented what happens if the file
>> is read in its entirety. I was expecting a null or an empty map, or even
>> some exception, but none happened and the parser kept on producing (empty)
>> sequences.
>>
>> Could anyone enlighten me? I'm probably missing the point here. Maybe
>> there is a better way to do this (there used to be the SequenceIterator if
>> I remember correctly, but I can't find that in version 4.0).
>>
>>
>>
>> Regards, Henk
>>
>> My setup: windows 7 64-bit, java 1.8.0_45 64 bit, BioJava 4.0.0 via Maven.
>> --
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
>
> --
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150617/9456d2c3/attachment.html>


More information about the Biojava-l mailing list