[Biojava-dev] Biojava.util package?

Scooter Willis HWillis at scripps.edu
Sun Apr 1 23:07:56 UTC 2012


It may need tweaking as it was written to support a couple different gene prediction packages. Parser will be fine but dealing with context is subjective.

----- Reply message -----
From: "Andy Yates" <ayates at ebi.ac.uk>
To: "Scooter Willis" <HWillis at scripps.edu>
Cc: "P. Troshin" <to.petr at gmail.com>, "biojava-dev" <biojava-dev at lists.open-bio.org>
Subject: [Biojava-dev] Biojava.util package?
Date: Sun, Apr 1, 2012 5:05 pm



Hi Scooter,

Excellent news I just wanted to make sure that the right specifications were being used :)

Andy

On 1 Apr 2012, at 22:06, Scooter Willis wrote:

> Andy
>
> In the genome package I have parsers for GTF, GFF and GF3 and a writer for GFF3.
>
> Scooter
>
> ----- Reply message -----
> From: "Andy Yates" <ayates at ebi.ac.uk>
> To: "P. Troshin" <to.petr at gmail.com>
> Cc: "biojava-dev" <biojava-dev at lists.open-bio.org>
> Subject: [Biojava-dev] Biojava.util package?
> Date: Sun, Apr 1, 2012 12:42 pm
>
>
>
> Hi
>
> This is the latest spec for GFF3
>
> http://www.sequenceontology.org/gff3.shtml
>
> All the best,
>
> Andy
>
> Sent from my mobile.
>
> On 1 Apr 2012, at 18:03, "P. Troshin" <to.petr at gmail.com> wrote:
>
> >>> Also what other parsers you are going to write?
> >> I've been looking into the GenBank, Stockholm, CATH, and UniProt XML
> >> formats, which are mentioned here:
> >> http://biojava.org/wiki/BioJava3_Feature_Request
> >
> > These are good suggestions. Also could you have a look at more
> > multiple sequence alignment formats, e.g. PIR, PFAM, Stockholm, MSF,
> > Clustal? Sequence features parser like GFF
> > (http://www.sanger.ac.uk/resources/software/gff/spec.html) might be
> > useful too. Phylogeny parsers e.g. Newick tree file parser etc. As for
> > the Genbank parser, I think we should focus on the XML version of the
> > Genbank file as this is now widely available and use standard Java XML
> > readers for the implementation.
> >
> > Regards,
> > Peter
> >
> >>> Now you need to look at the parsers in BioJava and have an idea of
> >>> how you are going to unify them.
> >> This is what I was trying to figure out earlier. After looking at
> >> BioPython, I think it might be effective to read files into a common
> >> sequence class (BioPython uses SeqRecord), and then provide utilities
> >> to convert from this sequence to others like DNASequence, RNASequence,
> >> and ProtienSequence. This could avoid some of the complexities of the
> >> Abstract Factory and Builder patterns that are sometimes used in
> >> situations like this. Additionally, it shouldn't be too hard to unify
> >> the current parsers under this system. FastaReader and FastaWriter
> >> already have interfaces that make it easy to extend its functionality,
> >> so they won't be a problem. FastqReader already does something like
> >> what I'm proposing, so it shouldn't be too difficult to adapt either.
> >> The others seem to be somewhat different, so I'll have to examine them
> >> more closely.
> >>
> >> David
> >>
> >> On Sat, Mar 31, 2012 at 7:38 PM, P. Troshin <to.petr at gmail.com> wrote:
> >>>>>> Does this look like a fair list?
> >>>
> >>> Yes your important feature list makes a lot of sense, though I do not
> >>> think any of the other features do (yes, your code needs to have
> >>> sensible defaults, but also custom function for more specific cases).
> >>> Now you need to look at the parsers in BioJava and have an idea of how
> >>> you are going to unify them.
> >>> Also what other parsers you are going to write?
> >>> We are slipping into implementation here, but Java is OO language,
> >>> although you can store FASTA sequence in a Map, but it is not going to
> >>> be as flexible as a custom object.
> >>> It may be a semantic difference but it is an important one, it is the
> >>> difference between good API and bad API, easy to use or not so easy to
> >>> use code. David, how much experience do you have with Java?
> >>>
> >>> Regards,
> >>> Peter
> >>>
> >>>
> >>> On 31 March 2012 18:16, David Felty <davfelty at gmail.com> wrote:
> >>>> I've been looking at the file parsers for BioPython and BioPerl, and
> >>>> here are some features I've compiled:
> >>>> Important features:
> >>>> - Conversion between file formats
> >>>> - Lazy IO; useful for large files
> >>>> - Use Iterable interface so we get Java foreach over sequences
> >>>> - Index sequences by ID (turn a list of sequences to a map from ID -> seq)
> >>>> - Fetching from remote databases
> >>>>
> >>>> Other features:
> >>>> - Restrict fields needed to speed up parsing; see
> >>>> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
> >>>> - Auto-detect file format (use file extension)
> >>>> - General-purpose API with sensible defaults for most cases, and a
> >>>> more specific but complex API for more control
> >>>> - Index sequences by a user-defined value
> >>>> - Store indexed database files locally (BioPython stores as a SQLite database)
> >>>>
> >>>> Does this look like a fair list? I tried to look for common use cases
> >>>> in BioJava's tutorial, but I only found this page, which comes from
> >>>> BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
> >>>> Are there any other useful sources I could look at? Or perhaps even
> >>>> some real-world code that makes use of parsers?
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
> >>>>>> But are there any additional features anyone wants me to consider? Once again, I don't
> >>>>>> have the same experience as many of you, so your input is very
> >>>>>> helpful!
> >>>>>
> >>>>> David, I think this is a pretty good list. Remember you are here into
> >>>>> something more than just a FASTA parser.
> >>>>>
> >>>>>> But here is what I've gathered so far from a combination
> >>>>>> of already-existing code and people's responses:
> >>>>>
> >>>>> I think this is a very good approach.
> >>>>> Look at the existing parsers in BioJava and beyond, the features that
> >>>>> are common will be the most important. Less common will be useful in
> >>>>> some cases but less in others. Come up with a set of use cases and try
> >>>>> using the parsers to achieve them, see how easy (or indeed possible)
> >>>>> it is going to be with various parsers. I appreciate this is a lot of
> >>>>> work, but this way you'll know by heart what a good parser constitutes
> >>>>> of.
> >>>>> You can learn from many implementations to get you own just right.
> >>>>> Once you've done this, you are going to be the expert and will be able
> >>>>> to come up with a list of features in order of importance that your
> >>>>> parser is going to have and have some guesstimate of how long it is
> >>>>> going to take you to implement them. Do not hesitate to ask the
> >>>>> community if there is something you cannot get your heard around.
> >>>>>
> >>>>> Good luck,
> >>>>> Peter
> >>>>>
> >>>>>
> >>>>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
> >>>>>>> I'd suggest you step aside from the
> >>>>>>> details of implementation. Think about what features your parser(s)
> >>>>>>> must have and when how you are going to achieve them
> >>>>>>
> >>>>>> Thank you for this! I now realize that I've been concentrating too
> >>>>>> much on the implementation rather than the features. The
> >>>>>> implementation will be important when (or if) I actually work on the
> >>>>>> project during GSoC, but for now, I'll try to focus on features for my
> >>>>>> proposal.
> >>>>>>
> >>>>>> Unfortunately, I'm not very acquainted with the world of computational
> >>>>>> biology, so I can't be sure what features would be most useful for the
> >>>>>> file parsers. But here is what I've gathered so far from a combination
> >>>>>> of already-existing code and people's responses:
> >>>>>> - Simple api
> >>>>>> - Robust
> >>>>>> - Extensible
> >>>>>> - Good performance
> >>>>>> - Feature-rich
> >>>>>> - Wide variety of parsers
> >>>>>> - Proxy-fetching from remote databases (by ID or location)
> >>>>>> - Local caching
> >>>>>> - Auto-detection of data type
> >>>>>> - Auto-detection of file format
> >>>>>> - Lazy IO
> >>>>>> - Random access file reading
> >>>>>>
> >>>>>> Obviously, these are not all of equal importance, so I'll have to pick
> >>>>>> out the most important ones for my proposal. But are there any
> >>>>>> additional features anyone wants me to consider? Once again, I don't
> >>>>>> have the same experience as many of you, so your input is very
> >>>>>> helpful!
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>>
> >>>>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
> >>>>>>> Hi David,
> >>>>>>>
> >>>>>>> Great to see such a discussion! You should see how important your work
> >>>>>>> for Bio community is going to be.
> >>>>>>>
> >>>>>>> Now, what you need to do is to try taking into account what other
> >>>>>>> people were suggesting and put it into your proposal. It's not going
> >>>>>>> to be any good just to add a bunch of opinions; you need to come up
> >>>>>>> with a coherent proposal. For this I'd suggest you step aside from the
> >>>>>>> details of implementation. Think about what features your parser(s)
> >>>>>>> must have and when how you are going to achieve them?
> >>>>>>> I'd suggest that your parsers should be
> >>>>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
> >>>>>>> - robust
> >>>>>>> - extensible
> >>>>>>> - have good performance
> >>>>>>> - most importantly, have sufficiently rich feature set so that we can
> >>>>>>> replace other parsers (for the same format) in BioJava with yours.
> >>>>>>>
> >>>>>>> Do not forget to split your work in several achievable stages.
> >>>>>>>
> >>>>>>> I'd be careful about transferring the design from Python and
> >>>>>>> especially a decade old Perl implementation straight to Java. While
> >>>>>>> high level concerts may be the similar, implementation details should
> >>>>>>> not be. It’s not that there is anything wrong with these parsers, it
> >>>>>>> just that the languages are different. It is good to know how things
> >>>>>>> are done elsewhere, but I'd suggest that for Java implementation you
> >>>>>>> should be taking inspiration from some well know Java feature. For
> >>>>>>> example, the Java Collections - a set of highly regarded tools for
> >>>>>>> working with various collections of objects. Also do some reading on
> >>>>>>> Java enums, your proposed implementation will definitely benefit from
> >>>>>>> using them.
> >>>>>>>
> >>>>>>> Have fun,
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Peter
> >>>>>>>
> >>>>>>>
> >>>>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
> >>>>>>>> Hey Andreas,
> >>>>>>>>
> >>>>>>>> It it wouldn't be too difficult to make a method that can infer the
> >>>>>>>> file type using the file extension. In fact, it looks like BioPerl's
> >>>>>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
> >>>>>>>> takes the route of "explicit is better than implicit," and requires
> >>>>>>>> that you explicitly give the format. Perhaps BioJava could take both
> >>>>>>>> routes, and have an overloaded parse method that infers the file type,
> >>>>>>>> along with the regular explicit method.
> >>>>>>>>
> >>>>>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
> >>>>>>>> http://pastebin.com/KLcpq8Qb
> >>>>>>>> This would work similarly:
> >>>>>>>>
> >>>>>>>> InputStream is = ...
> >>>>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It looks like the other sequence readers aren't as clear-cut, so they
> >>>>>>>> may need a bit more wrapping before they can be adapted to this
> >>>>>>>> method. A common problem is that sequence readers don't return a
> >>>>>>>> specific type of sequence, like with
> >>>>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
> >>>>>>>> just contains the sequence data in itself. We might want to create
> >>>>>>>> methods that convert the UniprotProxySequenceReader into sequences
> >>>>>>>> that make more sense, like DNASequence and ProteinSequence.
> >>>>>>>>
> >>>>>>>> I'll look into this more later, I have to go to class.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi David,
> >>>>>>>>>
> >>>>>>>>> so far it still feels like a wrapper for what is already there. Try to
> >>>>>>>>> take it to the next level. Why does the user still need to provide the
> >>>>>>>>> type of file, can't this be auto-detected? What is the behaviour for
> >>>>>>>>> non-fasta files, what can be supported and where are the limits, etc.
> >>>>>>>>>
> >>>>>>>>> Andreas
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
> >>>>>>>>>> I've actually been working on something like this for my GSoC proposal,
> >>>>>>>>>> here's what I came up with:
> >>>>>>>>>>
> >>>>>>>>>> public class SeqIO {
> >>>>>>>>>>    public static final int FASTA = 0;
> >>>>>>>>>>    public static final int FASTQ = 1;
> >>>>>>>>>>    public static final Class<DNASequence> DNA = DNASequence.class;
> >>>>>>>>>>    public static final Class<ProteinSequence> PROTEIN =
> >>>>>>>>>> ProteinSequence.class;
> >>>>>>>>>>
> >>>>>>>>>>    public static <S extends Sequence> Iterable<S> parse(InputStream is,
> >>>>>>>>>> int fileFormat, Class<S> seqType) throws Exception {
> >>>>>>>>>>        switch (fileFormat) {
> >>>>>>>>>>            case FASTA:
> >>>>>>>>>>                if (seqType == DNA) {
> >>>>>>>>>>                    return (Iterable<S>)
> >>>>>>>>>> FastaReaderHelper.readFastaDNASequence(is);
> >>>>>>>>>>                } else if (seqType == PROTEIN) {
> >>>>>>>>>>                    // etc...
> >>>>>>>>>>                }
> >>>>>>>>>> break;
> >>>>>>>>>>            case FASTQ:
> >>>>>>>>>>                // etc...
> >>>>>>>>>>        }
> >>>>>>>>>>    }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> It would be used like so:
> >>>>>>>>>>
> >>>>>>>>>> InputStream is = ...
> >>>>>>>>>> Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
> >>>>>>>>>> for (DNASequence s : seqs) {
> >>>>>>>>>>   // do something
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Obviously it's not the prettiest and a lot could be changed, but that's my
> >>>>>>>>>> initial design. I tried to base it off BioPython's SeqIO, but static typing
> >>>>>>>>>> and the variety of Sequence types forced me to put in some nasty generics.
> >>>>>>>>>> Any tips would be appreciated!
> >>>>>>>>>>
> >>>>>>>>>> David
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
> >>>>>>>>>> biojava at hannes.oib.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Yes, something like a simplifying and unifying wrapper would be what I
> >>>>>>>>>>> am thinking of.
> >>>>>>>>>>>
> >>>>>>>>>>> Hannes
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
> >>>>>>>>>>>> Hi Hannes,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I guess this is pretty similar to:
> >>>>>>>>>>>>
> >>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
> >>>>>>>>>>>>
> >>>>>>>>>>>> we have also been using "proxy" objects to fetch sequence data on the fly
> >>>>>>>>>>>>
> >>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
> >>>>>>>>>>>>
> >>>>>>>>>>>> As such I think we should discuss this a bit more. If we can find a
> >>>>>>>>>>>> common api that is simple and works with both local files as well as
> >>>>>>>>>>>> remote proxy objects, that would be nice. There should be no need to
> >>>>>>>>>>>> change much of the existing code, but perhaps there could be a
> >>>>>>>>>>>> simplified wrapper for what is already there.
> >>>>>>>>>>>>
> >>>>>>>>>>>>  Andreas
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
> >>>>>>>>>>>> <biojava at hannes.oib.com> wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I browsed around in the sister projects Biopython and Bioperl a bit,
> >>>>>>>>>>>>> and noticed that many of the user interaction with the code goes
> >>>>>>>>>>>>> through classes like SeqIO, SearchIO, AlignIO...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So that got me thinking: how about we create similar "Interface"
> >>>>>>>>>>>>> classes in Biojava?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PROS:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  - easy change for programmers who switch languages
> >>>>>>>>>>>>>  - easy base interface that can be used in cookbook examples
> >>>>>>>>>>>>>  - makes code more readable if designed properly
> >>>>>>>>>>>>>  - easy access to features that are spread over the whole codebase but
> >>>>>>>>>>>>> are connected anyway, like all file parsers
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> CONS:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  - another thing to maintain
> >>>>>>>>>>>>>  - creates possible cross-dependencies (but if you don't want that,
> >>>>>>>>>>>>> just use the existing classes directly)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are your thoughts?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> python from http://biopython.org/wiki/SeqIO:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> from Bio import SeqIO
> >>>>>>>>>>>>> handle = open("example.fasta", "rU")
> >>>>>>>>>>>>> for record in SeqIO.parse(handle, "fasta") :
> >>>>>>>>>>>>>    print record.id
> >>>>>>>>>>>>> handle.close()
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> possible equivalent in biojava (support for streaming API, Iterators,
> >>>>>>>>>>> etc?):
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> import org.biojava3.util.SeqIO;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> File file = new File("example.fasta");
> >>>>>>>>>>>>> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
> >>>>>>>>>>>>> while (seqIO.hasNext()) {
> >>>>>>>>>>>>>    System.out.println(seqIO.next());
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> file.close();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hannes
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> biojava-dev mailing list
> >>>>>>>>>>>>> biojava-dev at lists.open-bio.org
> >>>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -----------------------------------------------------------------------
> >>>>>>>>>>>> Dr. Andreas Prlic
> >>>>>>>>>>>> Senior Scientist, RCSB PDB Protein Data Bank
> >>>>>>>>>>>> University of California, San Diego
> >>>>>>>>>>>> (+1) 858.246.0526
> >>>>>>>>>>>> -----------------------------------------------------------------------
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> biojava-dev mailing list
> >>>>>>>>>>> biojava-dev at lists.open-bio.org
> >>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> biojava-dev mailing list
> >>>>>>>>>> biojava-dev at lists.open-bio.org
> >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> biojava-dev mailing list
> >>>>>>>> biojava-dev at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev





More information about the biojava-dev mailing list