[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Carl Mäsak cmasak at gmail.com
Fri Nov 6 16:25:57 UTC 2009


I'm using RichSequenceIterator to read FASTA files containing
proteins. Somehow it doesn't work when the protein sequences are in
lowercase, which they sometimes are when downloaded from e.g. Uniprot.
My code fails to recognize the following file as containing a protein
sequence:

>OPSD_FELCA
mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
cmlttlccgknplgddeasttgsktetsqvapa

What am I missing? Here's the code I'm using to read in sequences:

    private List<ISequence> sequencesFromInputStream(InputStream stream) {

        BufferedInputStream bufferedStream = new BufferedInputStream(stream);
        Namespace ns = RichObjectFactory.getDefaultNamespace();
        RichSequenceIterator seqit = null;

        try {
            seqit = RichSequence.IOTools.readStream(bufferedStream, ns);
        } catch (IOException e) {
            logger.error("Couldn't read sequences from file", e);
            return Collections.emptyList();
        }

        List<ISequence> sequences = new ArrayList<ISequence>();
        try {
            while ( seqit.hasNext() ) {
                RichSequence rseq;
                    rseq = seqit.nextRichSequence(); // *error occurs here*
                if (rseq == null)
                    continue;
                String alphabet = rseq.getAlphabet().getName();
                sequences.add(
                      "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
                    : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
                    :                          new BiojavaProtein(rseq) );
            }
        } catch (NoSuchElementException e) {
            logger.error("Read past last sequence", e);
        } catch (BioException e) {
            logger.error(e); // *ends up here*
        }

        return sequences;
    }

Grateful for any pointers you might have.

Regards,
// Carl Mäsak




More information about the Biojava-l mailing list