[Biojava-l] sort fasta file

xyz mitlox at op.pl
Fri Mar 26 09:57:41 UTC 2010


@Andy: Thank you for the explanation. After the last sequence in the
input file in no newline character. 

@James: I change the code in order to get the biggest sequence first,
but the last sequence is missing. 


import java.io.*;
import java.util.*;

import org.biojava.bio.BioException;
import org.biojava.bio.symbol.*;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.*;

import java.util.Comparator;

public class SortFasta2 {

  static private class RichSequenceComparator implements
  Comparator<RichSequence> {

    public int compare(RichSequence seq1, RichSequence seq2) {
      return  seq2.length() - seq1.length();
    }
  }

  // Usage:  SortFasta unsortedFile.fasta
  public static void main(String[] args) throws FileNotFoundException,
  BioException {

    String fastaFile = "sortFasta.fasta";

    BufferedReader br = new BufferedReader(new FileReader(fastaFile));
    SimpleNamespace ns = new SimpleNamespace("biojava");

    Alphabet protein = AlphabetManager.alphabetForName("DNA");

    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br,
            protein.getTokenization("token"),
            ns);
    

    SortedSet<RichSequence> sorted = new TreeSet<RichSequence>(new
    SortFasta2.RichSequenceComparator());

    while (rsi.hasNext()) {
      sorted.add(rsi.nextRichSequence());
    }

    Iterator<RichSequence> sortedIt = sorted.iterator();

    /*Do whatever you want here with the ascending list of
    RichSequences by length, I'll just print them. */
    while (sortedIt.hasNext()) {
      //System.out.println(((RichSequence) sortedIt.next()).length());
      //System.out.println(sortedIt.next().getComments());
      System.out.println(sortedIt.next().seqString());
    }
  }
}

Input file:
>1
atccccc
>2
atccccctttttt
>3
atccccccccccccccccctttt
>4
tttttttccccccccccccccccccccccc
>5
tttttttcccccccccccccccccccccca

Output on the screen:
tttttttccccccccccccccccccccccc
atccccccccccccccccctttt
atccccctttttt
atccccc

How is it possible to get the last sequence and print the output in
fasta format on the screen?

Thank you in advance.




On Thu, 25 Mar 2010 10:17:31 -0400
James Swetnam wrote:

> Just replace the system.out.println with whatever you want to do with
> the sequences; write them to a file, etc.
> 
> James
> 

On Fri, 26 Mar 2010 09:40:28 +0000
"Andy Law (RI)" wrote:

> Does your input file have a line feed at the end or not? (Just a  
> thought)
> 
> Comparable is for comparing two objects using their "natural"
> ordering and is therefore a "fundamental" property of the class. A
> Comparator lets you compare/sort two objects on any characteristics
> and you can have many different comparators. Since this is a somewhat
> arbitrary way of comparing sequences (you could sort them on
> alphabetical sequence for example, or GC content), I guess that's why
> James used a comparator.
> 




More information about the Biojava-l mailing list