[Biopython-dev] [Biopython (old issues only) - Bug #2382] (Closed) Generic Roche or GSFlex "FASTA" parser

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Mon Nov 14 17:10:13 UTC 2016


Issue #2382 has been updated by Peter Cock.

Description updated
Status changed from New to Closed
% Done changed from 0 to 100

I'm going to close this issue. We have FASTA and QUAL parsers in Biopython, but I can't think of any other FASTA-like examples which have been requested over the years which would justify a general framework for a wider loosely defined family of FASTA-like formats. However, file formats always seem to be changing in bioinformatics so who knows what things will be like in a few more years? Thanks everyone for your comments and suggestions here.

----------------------------------------
Bug #2382: Generic Roche or GSFlex  "FASTA" parser
https://redmine.open-bio.org/issues/2382#change-15378

* Author: Jared Flatow
* Status: Closed
* Priority: Normal
* Assignee: Biopython Dev Mailing List
* Category: Main Distribution
* Target version: Not Applicable
* URL: 
----------------------------------------
I would like to be able read in and iterate over records in generic fasta files of the format:
>header
data
>header
data
...

This iterator should return Bio.Fasta.Record objects with the corresponding header and data fields.

I suggest putting this inside the existing Bio.Fasta module and updating Bio.SeqIO.Fasta to use this iterator and transform the records returned into Bio.SeqRecord objects.

This should make it easier to add metadata to SeqRecord objects parsed in from FASTA. Consider the following example for illustration. I have data from a genome sequencing machine that outputs pairs of files. One contains the sequence reads which look like this, the other contains estimates of the quality of each base call in the sequence.

The sequence file might look something like this (only with hundreds of thousands more entries):

>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run=R_runname
CAATATAATTTCTCTTAAAATTATTCCCATGGCCAGGTGTGGTGGCTCACACCTGTAGTC
CCGGCACTTTGGGAGGCCAAGGCACACAGGGGATAGG
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
GGTCTCCAGTGCCCTGTCTCCCCATATTTCTGACACACCTTCTCACAGCCTGGCCCATCT
TGCTGGGTCCCTCTTCTCCTCCCTTCCTGCTCCATTTGTCAACACTGCTGGGACATTAGA
ATTCAGATCTCCCGGGTCACCG
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
AAAGTGACTAAAGAATCAATTTACATTAATATTCTATGTGAACAGGCAAAATACTTACAA
AGAAGTAGAGAAAATATGAATTCAGTACAGAATTCAGATCTCCCGGGTCACCG

The corresponding quality score file might look something like this:

>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run= R_runname
27 28 21 27 27 27 28 22 28 25 3 27 27 27 28 21 33 31 20 6 28 21 26 26 18 28 25 2 26 25 29 23 31 24 27 29 22 27 27 27 29 23 27 31 25 27 27 27 27 27 27 32 26 27 27 27 27 26 27 33
30 12 32 26 27 27 27 33 30 12 33 30 12 26 31 25 33 27 32 28 33 28 27 27 27 27 27 26 33 32 20 7 27 27 27 32 26
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
28 9 26 24 27 27 20 26 18 25 27 32 29 10 26 26 27 18 25 32 30 17 1 25 27 22 32 30 12 27 27 22 26 25 27 23 25 28 21 32 27 27 27 25 26 27 26 25 27 20 26 26 19 28 25 3 25 27 22 27
19 24 24 24 32 29 11 24 34 31 17 23 23 30 23 27 25 30 23 27 33 31 17 27 20 28 21 27 25 26 26 30 24 27 33 31 13 26 27 27 31 25 27 25 23 26 16 26 27 30 27 7 27 27 27 32 27 26 26 32
27 30 26 27 27 27 27 27 27 27 30 27 6 34 31 17 27 21 27 32 28 18
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
29 26 5 25 27 24 27 27 27 30 27 7 26 27 19 25 26 31 26 34 32 16 20 27 26 32 27 32 28 27 25 26 18 27 25 27 26 26 24 27 31 25 27 27 31 26 26 34 32 23 11 26 22 27 32 26 27 26 32 30
11 26 31 24 27 27 25 23 27 27 33 30 19 4 17 26 25 26 31 27 30 26 27 26 22 26 18 24 27 26 32 26 32 28 27 27 25 27 25 24 25 31 28 10 34 31 15 27 21 27 28 21 27

I would like to be able to do the following:

# create a function to parse the header line and return a dictionary
def parse_gsflex_header(gs_header):
        parts = gs_record.description.split(' ')
        assert len(parts) == 5
        xy = parts[2].split('=')[1].split('_')
        return {'letters': gs_record.seq.tostring(),
                'name': parts[0],
                'length': parts[1].split('=')[1],
                'xpos': xy[0],
                'ypos': xy[1],
                'region': parts[3].split('=')[1],
                'run': parts[4].split('=')[1]}

# Bio.SeqIO.FastaIO wraps the Bio.Fasta parser, might look something like this
class Fasta(): # or however its organized
   def data_toseq(data):
      # do some parsing of the data
      return Seq(...)

   def parse(file, header_todict):
     return (SeqRecord(seq=data_toseq(record.data), **header_todict(record.header)) for record in Bio.Fasta.parse(file))

# I create an initial SeqRecord dictionary using the Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file,  parse_gsflex_header))

# Then I iterate over the sequences in the qual file and look them up in the seq_dict
# setting the quality scores as I find them them
for record in Bio.Fasta.parse(qual_file):
	seq_dict[my_header_todict(record.header)['id']].quality =  my_qualitycleanup(record.data)

This would work well for parsing all kinds of FASTA-like files and provides a simple mechanism for dealing with them record by record.

---Files--------------------------------
sift.py (7.9 KB)


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20161114/1a94eb3d/attachment.html>


More information about the Biopython-dev mailing list