[Biopython-dev] Bio.GFF and Brad's code

Mon Apr 13 12:16:10 UTC 2009

On Sat, Apr 11, 2009 at 12:29 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Hi Brad,
>
> Thanks for the examples; that clarified it a lot.

I haven't tried the code yet, but I have a GFF file I need to convert
into FASTA format.  Hopefully later this week I'll get to that...

There are a few things I can ask now through:

Why are the functions _gff_line_map() and _gff_line_reduce() private
(leading underscores)?  I had thought you wanted to make the
map/reduce approach available to people trying to parse GFF files on
multiple threads (e.g. using disco) which would require them to use
these two functions, wouldn't it?  If so, they should be part of the
public API.

I don't see any support for the optional FASTA block in a GFF file.
Is this something you intend to add later Brad?  See also my thoughts
below for Bio.SeqIO integration.

> I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
> Looking at your first example:
>
>> from BCBio.GFF.GFFParser import GFFAddingIterator
>>
>> gff_iterator = GFFAddingIterator()
>> rec_dict = gff_iterator.get_all_features(gff_file)
>>
>> The returned dictionary is like a dictionary from
>> SeqIO.to_dict;
>> keys are ids and values are SeqRecords.
>
> It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> for line in handle:
>    # call the appropriate GFF function on the line

I think the appropriate GFF function here might be Brad's
_gff_line_map().  This knows about different GFF line types (e.g. ##
header lines).  I'm not sure if a line based approach like this can
cope with the optional ##FASTA block through.

> The second point is about GFFAddingIterator.get_all_features. If this
> is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> Then the code looks as follows:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> rec_dict = GFF.to_dict(handle)

Well, the Bio.SeqIO.to_dict() function takes a SeqRecord list/iterator
rather than a handle, but that might make sense here.

> Another thing to consider is that IDs in the GFF file do not need to be unique.
> For example, consider a GFF file that stores genome mapping locations for
> short sequences stored in a Fasta file. Since each sequence can have more
> than one mapping location, we can have multiple lines in the GFF file for one
> sequence ID.

That sounds nasty.  Do you have any example files of this we could use
for a test case?

> The last point is about storing SeqRecords in rec_dict. A GFF file typically
> does not store sequences; if it does, it's not clear which field in the GFF file
> does. On the other hand, a SeqRecord often does not contain the
> chromosomal location, which is what the GFF file stores. So why use a
> SeqRecord for GFF information?

I don't think the GFF parser should only return SeqRecord object, but
I do see a use for this (via Bio.SeqIO).  GFF files could be
represented as a list of SeqFeature objects, and using a SeqRecord to
hold this seems very natural to me.  It also means we could use
Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a
BioSQL database.

If you look at the NCBI FTP site, they often provide genome sequences
in a range of file formats including GenBank and GFF.

e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/

The GenBank files contain the features plus the sequence,
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk

Their GFF3 file only contains the features:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff

Some GFF files will include the sequence too, in this case we can
fetch it in FASTA format:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna

In principle, you could parse this FASTA file and the GFF3 file and
put together a GenBank file - or vice versa.

As an aside, I would also consider adding protein table support on the
same lines, look at this file:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt
The header information gives us the genome size, so Bio.SeqIO could
return a SeqRecord with lots of SeqFeature objects and for the
SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp.
 This is something I might look at implementing myself after Biopython
1.50 is out.  We should be able to read in a GenBank file and output a
PTT file, and verify it matches the NCBI provided version of the PTT
file.

Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
give me a SeqRecord with lots of SeqFeature objects.  If the sequence
is present in the file, it should use that (not the case for these
NCBI GFF3 files).  Otherwise, we wouldn't necessarily know the actual
sequence length which we'd need to use the new Bio.Seq.UnknownSeq
object.  However, we can infer from the maximum feature coordinates a
minimum sequence length.  For these NCBI GFF3 files, as there is a
source feature this does actually give use the genome length, so this
should work very nicely.

Peter