[Biopython-dev] Bio.GFF and Brad's code

Brad Chapman chapmanb at 50mail.com
Mon Apr 20 13:29:46 UTC 2009


Michiel;
Thanks for trying this out and your thoughts.

> > # It would be better to pass a handle to get_all_features
> > # instead of a file name. The file may be gzipped or bzipped,
> > # or the user may want to read it from the internet.

Yes, this is the way it was originally designed. I changed to files to
be consistent with a distributed Disco implementation, which needs to be
fed a file instead of a handle. Your suggestion is a good one. Let me
give some thought to separating the interfaces, as handles would be more
consistent with the rest of Biopython.

[accessing start and end]
> >>> print rec_dict['1'].features[0].location.start
> 20228
> >>> rec_dict['1'].features[0].location.start.position
> 20228
[...]
> Coupled with a variation of Brad's suggestion of adding start
> and end properties to the SeqFeature, if we make these act
> as proxies for feature.location.start and feature.location.end
> that would become just:
> 
> record = ...
> feature = record.features[5] #for example
> sub_seq = my_seq[feature.start:feature.end]

Thanks Peter, that's exactly right. Accessing the start and end
coordinates in SeqFeatures is unnecessarily cumbersome right now,
but can be fixed fairly simply. We should be able to get this in now
that 1.50 is rolled out. Eric's decorator way of doing this was very
nice.

> The fuzzy locations (from GenBank or EMBL files) would need
> a bit of care, ideally matching how the NCBI do things (easily
> checked by taking an NCBI GenBank files and comparing it to
> the simpler locations given in their FASTA, PTT or GFF files).

To be clear, start and end in SeqFeature would be integers and not 
handle any fuzzy stuff. All of the representation is still there for
those actually dealing with fuzziness, but the top level attributes
would expose the coordinates nicely for the remaining 99% of cases.

> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), 
> the SeqFeatures are way too complicated for my mind.
[...]
> For a basic parser, I like the _gff_line_map function much better. 
> Applied to the first line in the GFF file, it returns
[...]
> which is exactly what I need, in (almost) the places where I'd expect them.

Does solving the start/end problem as described above help bridge the
gap between SeqFeatures and the custom representation? Are there other
usability issues you found? I would prefer to expose one data structure
and think SeqFeature can handle the data well. They scale to nested
cases, and will be familiar to those using features in SeqIO or BioSQL.

Brad



More information about the Biopython-dev mailing list