[Biojava-dev] Biojava.util package?

Sun Apr 1 21:15:43 UTC 2012

Great! Less parsers to write.

Regards,
Peter

On 1 April 2012 22:06, Scooter Willis <HWillis at scripps.edu> wrote:
> Andy
>
> In the genome package I have parsers for GTF, GFF and GF3 and a writer for
> GFF3.
>
> Scooter
>
>
> ----- Reply message -----
> From: "Andy Yates" <ayates at ebi.ac.uk>
> To: "P. Troshin" <to.petr at gmail.com>
> Cc: "biojava-dev" <biojava-dev at lists.open-bio.org>
> Subject: [Biojava-dev] Biojava.util package?
> Date: Sun, Apr 1, 2012 12:42 pm
>
>
>
> Hi
>
> This is the latest spec for GFF3
>
> http://www.sequenceontology.org/gff3.shtml
>
> All the best,
>
> Andy
>
> Sent from my mobile.
>
> On 1 Apr 2012, at 18:03, "P. Troshin" <to.petr at gmail.com> wrote:
>
>>>> Also what other parsers you are going to write?
>>> I've been looking into the GenBank, Stockholm, CATH, and UniProt XML
>>> formats, which are mentioned here:
>>> http://biojava.org/wiki/BioJava3_Feature_Request
>>
>> These are good suggestions. Also could you have a look at more
>> multiple sequence alignment formats, e.g. PIR, PFAM, Stockholm, MSF,
>> Clustal? Sequence features parser like GFF
>> (http://www.sanger.ac.uk/resources/software/gff/spec.html) might be
>> useful too. Phylogeny parsers e.g. Newick tree file parser etc. As for
>> the Genbank parser, I think we should focus on the XML version of the
>> Genbank file as this is now widely available and use standard Java XML
>> readers for the implementation.
>>
>> Regards,
>> Peter
>>
>>>> Now you need to look at the parsers in BioJava and have an idea of
>>>> how you are going to unify them.
>>> This is what I was trying to figure out earlier. After looking at
>>> BioPython, I think it might be effective to read files into a common
>>> sequence class (BioPython uses SeqRecord), and then provide utilities
>>> to convert from this sequence to others like DNASequence, RNASequence,
>>> and ProtienSequence. This could avoid some of the complexities of the
>>> Abstract Factory and Builder patterns that are sometimes used in
>>> situations like this. Additionally, it shouldn't be too hard to unify
>>> the current parsers under this system. FastaReader and FastaWriter
>>> already have interfaces that make it easy to extend its functionality,
>>> so they won't be a problem. FastqReader already does something like
>>> what I'm proposing, so it shouldn't be too difficult to adapt either.
>>> The others seem to be somewhat different, so I'll have to examine them
>>> more closely.
>>>
>>> David
>>>
>>> On Sat, Mar 31, 2012 at 7:38 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>>> Does this look like a fair list?
>>>>
>>>> Yes your important feature list makes a lot of sense, though I do not
>>>> think any of the other features do (yes, your code needs to have
>>>> sensible defaults, but also custom function for more specific cases).
>>>> Now you need to look at the parsers in BioJava and have an idea of how
>>>> you are going to unify them.
>>>> Also what other parsers you are going to write?
>>>> We are slipping into implementation here, but Java is OO language,
>>>> although you can store FASTA sequence in a Map, but it is not going to
>>>> be as flexible as a custom object.
>>>> It may be a semantic difference but it is an important one, it is the
>>>> difference between good API and bad API, easy to use or not so easy to
>>>> use code. David, how much experience do you have with Java?
>>>>
>>>> Regards,
>>>> Peter
>>>>
>>>>
>>>> On 31 March 2012 18:16, David Felty <davfelty at gmail.com> wrote:
>>>>> I've been looking at the file parsers for BioPython and BioPerl, and
>>>>> here are some features I've compiled:
>>>>> Important features:
>>>>> - Conversion between file formats
>>>>> - Lazy IO; useful for large files
>>>>> - Use Iterable interface so we get Java foreach over sequences
>>>>> - Index sequences by ID (turn a list of sequences to a map from ID ->
>>>>> seq)
>>>>> - Fetching from remote databases
>>>>>
>>>>> Other features:
>>>>> - Restrict fields needed to speed up parsing; see
>>>>> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
>>>>> - Auto-detect file format (use file extension)
>>>>> - General-purpose API with sensible defaults for most cases, and a
>>>>> more specific but complex API for more control
>>>>> - Index sequences by a user-defined value
>>>>> - Store indexed database files locally (BioPython stores as a SQLite
>>>>> database)
>>>>>
>>>>> Does this look like a fair list? I tried to look for common use cases
>>>>> in BioJava's tutorial, but I only found this page, which comes from
>>>>> BioJava 1.8:
>>>>> http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
>>>>> Are there any other useful sources I could look at? Or perhaps even
>>>>> some real-world code that makes use of parsers?
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>>> But are there any additional features anyone wants me to consider?
>>>>>>> Once again, I don't
>>>>>>> have the same experience as many of you, so your input is very
>>>>>>> helpful!
>>>>>>
>>>>>> David, I think this is a pretty good list. Remember you are here into
>>>>>> something more than just a FASTA parser.
>>>>>>
>>>>>>> But here is what I've gathered so far from a combination
>>>>>>> of already-existing code and people's responses:
>>>>>>
>>>>>> I think this is a very good approach.
>>>>>> Look at the existing parsers in BioJava and beyond, the features that
>>>>>> are common will be the most important. Less common will be useful in
>>>>>> some cases but less in others. Come up with a set of use cases and try
>>>>>> using the parsers to achieve them, see how easy (or indeed possible)
>>>>>> it is going to be with various parsers. I appreciate this is a lot of
>>>>>> work, but this way you'll know by heart what a good parser constitutes
>>>>>> of.
>>>>>> You can learn from many implementations to get you own just right.
>>>>>> Once you've done this, you are going to be the expert and will be able
>>>>>> to come up with a list of features in order of importance that your
>>>>>> parser is going to have and have some guesstimate of how long it is
>>>>>> going to take you to implement them. Do not hesitate to ask the
>>>>>> community if there is something you cannot get your heard around.
>>>>>>
>>>>>> Good luck,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>>>>>> I'd suggest you step aside from the
>>>>>>>> details of implementation. Think about what features your parser(s)
>>>>>>>> must have and when how you are going to achieve them
>>>>>>>
>>>>>>> Thank you for this! I now realize that I've been concentrating too
>>>>>>> much on the implementation rather than the features. The
>>>>>>> implementation will be important when (or if) I actually work on the
>>>>>>> project during GSoC, but for now, I'll try to focus on features for
>>>>>>> my
>>>>>>> proposal.
>>>>>>>
>>>>>>> Unfortunately, I'm not very acquainted with the world of
>>>>>>> computational
>>>>>>> biology, so I can't be sure what features would be most useful for
>>>>>>> the
>>>>>>> file parsers. But here is what I've gathered so far from a
>>>>>>> combination
>>>>>>> of already-existing code and people's responses:
>>>>>>> - Simple api
>>>>>>> - Robust
>>>>>>> - Extensible
>>>>>>> - Good performance
>>>>>>> - Feature-rich
>>>>>>> - Wide variety of parsers
>>>>>>> - Proxy-fetching from remote databases (by ID or location)
>>>>>>> - Local caching
>>>>>>> - Auto-detection of data type
>>>>>>> - Auto-detection of file format
>>>>>>> - Lazy IO
>>>>>>> - Random access file reading
>>>>>>>
>>>>>>> Obviously, these are not all of equal importance, so I'll have to
>>>>>>> pick
>>>>>>> out the most important ones for my proposal. But are there any
>>>>>>> additional features anyone wants me to consider? Once again, I don't
>>>>>>> have the same experience as many of you, so your input is very
>>>>>>> helpful!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com>
>>>>>>> wrote:
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> Great to see such a discussion! You should see how important your
>>>>>>>> work
>>>>>>>> for Bio community is going to be.
>>>>>>>>
>>>>>>>> Now, what you need to do is to try taking into account what other
>>>>>>>> people were suggesting and put it into your proposal. It's not going
>>>>>>>> to be any good just to add a bunch of opinions; you need to come up
>>>>>>>> with a coherent proposal. For this I'd suggest you step aside from
>>>>>>>> the
>>>>>>>> details of implementation. Think about what features your parser(s)
>>>>>>>> must have and when how you are going to achieve them?
>>>>>>>> I'd suggest that your parsers should be
>>>>>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>>>>>>> - robust
>>>>>>>> - extensible
>>>>>>>> - have good performance
>>>>>>>> - most importantly, have sufficiently rich feature set so that we
>>>>>>>> can
>>>>>>>> replace other parsers (for the same format) in BioJava with yours.
>>>>>>>>
>>>>>>>> Do not forget to split your work in several achievable stages.
>>>>>>>>
>>>>>>>> I'd be careful about transferring the design from Python and
>>>>>>>> especially a decade old Perl implementation straight to Java. While
>>>>>>>> high level concerts may be the similar, implementation details
>>>>>>>> should
>>>>>>>> not be. It’s not that there is anything wrong with these parsers, it
>>>>>>>> just that the languages are different. It is good to know how things
>>>>>>>> are done elsewhere, but I'd suggest that for Java implementation you
>>>>>>>> should be taking inspiration from some well know Java feature. For
>>>>>>>> example, the Java Collections - a set of highly regarded tools for
>>>>>>>> working with various collections of objects. Also do some reading on
>>>>>>>> Java enums, your proposed implementation will definitely benefit
>>>>>>>> from
>>>>>>>> using them.
>>>>>>>>
>>>>>>>> Have fun,
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>>>>>>> Hey Andreas,
>>>>>>>>>
>>>>>>>>> It it wouldn't be too difficult to make a method that can infer the
>>>>>>>>> file type using the file extension. In fact, it looks like
>>>>>>>>> BioPerl's
>>>>>>>>> SeqIO does something like this. On the other hand, BioPython's
>>>>>>>>> SeqIO
>>>>>>>>> takes the route of "explicit is better than implicit," and requires
>>>>>>>>> that you explicitly give the format. Perhaps BioJava could take
>>>>>>>>> both
>>>>>>>>> routes, and have an overloaded parse method that infers the file
>>>>>>>>> type,
>>>>>>>>> along with the regular explicit method.
>>>>>>>>>
>>>>>>>>> As for non-fasta files, I implemented a couple of fasq parsers
>>>>>>>>> here:
>>>>>>>>> http://pastebin.com/KLcpq8Qb
>>>>>>>>> This would work similarly:
>>>>>>>>>
>>>>>>>>> InputStream is = ...
>>>>>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER,
>>>>>>>>> SeqIO.PROTEIN);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It looks like the other sequence readers aren't as clear-cut, so
>>>>>>>>> they
>>>>>>>>> may need a bit more wrapping before they can be adapted to this
>>>>>>>>> method. A common problem is that sequence readers don't return a
>>>>>>>>> specific type of sequence, like with
>>>>>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>>>>>>> just contains the sequence data in itself. We might want to create
>>>>>>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>>>>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>>>>>>
>>>>>>>>> I'll look into this more later, I have to go to class.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> so far it still feels like a wrapper for what is already there.
>>>>>>>>>> Try to
>>>>>>>>>> take it to the next level. Why does the user still need to provide
>>>>>>>>>> the
>>>>>>>>>> type of file, can't this be auto-detected? What is the behaviour
>>>>>>>>>> for
>>>>>>>>>> non-fasta files, what can be supported and where are the limits,
>>>>>>>>>> etc.
>>>>>>>>>>
>>>>>>>>>> Andreas
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> I've actually been working on something like this for my GSoC
>>>>>>>>>>> proposal,
>>>>>>>>>>> here's what I came up with:
>>>>>>>>>>>
>>>>>>>>>>> public class SeqIO {
>>>>>>>>>>>    public static final int FASTA = 0;
>>>>>>>>>>>    public static final int FASTQ = 1;
>>>>>>>>>>>    public static final Class<DNASequence> DNA =
>>>>>>>>>>> DNASequence.class;
>>>>>>>>>>>    public static final Class<ProteinSequence> PROTEIN =
>>>>>>>>>>> ProteinSequence.class;
>>>>>>>>>>>
>>>>>>>>>>>    public static <S extends Sequence> Iterable<S>
>>>>>>>>>>> parse(InputStream is,
>>>>>>>>>>> int fileFormat, Class<S> seqType) throws Exception {
>>>>>>>>>>>        switch (fileFormat) {
>>>>>>>>>>>            case FASTA:
>>>>>>>>>>>                if (seqType == DNA) {
>>>>>>>>>>>                    return (Iterable<S>)
>>>>>>>>>>> FastaReaderHelper.readFastaDNASequence(is);
>>>>>>>>>>>                } else if (seqType == PROTEIN) {
>>>>>>>>>>>                    // etc...
>>>>>>>>>>>                }
>>>>>>>>>>> break;
>>>>>>>>>>>            case FASTQ:
>>>>>>>>>>>                // etc...
>>>>>>>>>>>        }
>>>>>>>>>>>    }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> It would be used like so:
>>>>>>>>>>>
>>>>>>>>>>> InputStream is = ...
>>>>>>>>>>> Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA,
>>>>>>>>>>> SeqIO.DNA);
>>>>>>>>>>> for (DNASequence s : seqs) {
>>>>>>>>>>>   // do something
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Obviously it's not the prettiest and a lot could be changed, but
>>>>>>>>>>> that's my
>>>>>>>>>>> initial design. I tried to base it off BioPython's SeqIO, but
>>>>>>>>>>> static typing
>>>>>>>>>>> and the variety of Sequence types forced me to put in some nasty
>>>>>>>>>>> generics.
>>>>>>>>>>> Any tips would be appreciated!
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>>>>>>>> biojava at hannes.oib.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, something like a simplifying and unifying wrapper would be
>>>>>>>>>>>> what I
>>>>>>>>>>>> am thinking of.
>>>>>>>>>>>>
>>>>>>>>>>>> Hannes
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Hannes,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess this is pretty similar to:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>>>>>>>>>>
>>>>>>>>>>>>> we have also been using "proxy" objects to fetch sequence data
>>>>>>>>>>>>> on the fly
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>>>>>>>>>>
>>>>>>>>>>>>> As such I think we should discuss this a bit more. If we can
>>>>>>>>>>>>> find a
>>>>>>>>>>>>> common api that is simple and works with both local files as
>>>>>>>>>>>>> well as
>>>>>>>>>>>>> remote proxy objects, that would be nice. There should be no
>>>>>>>>>>>>> need to
>>>>>>>>>>>>> change much of the existing code, but perhaps there could be a
>>>>>>>>>>>>> simplified wrapper for what is already there.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Andreas
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>>>>>>>>>> <biojava at hannes.oib.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I browsed around in the sister projects Biopython and Bioperl
>>>>>>>>>>>>>> a bit,
>>>>>>>>>>>>>> and noticed that many of the user interaction with the code
>>>>>>>>>>>>>> goes
>>>>>>>>>>>>>> through classes like SeqIO, SearchIO, AlignIO...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So that got me thinking: how about we create similar
>>>>>>>>>>>>>> "Interface"
>>>>>>>>>>>>>> classes in Biojava?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PROS:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - easy change for programmers who switch languages
>>>>>>>>>>>>>>  - easy base interface that can be used in cookbook examples
>>>>>>>>>>>>>>  - makes code more readable if designed properly
>>>>>>>>>>>>>>  - easy access to features that are spread over the whole
>>>>>>>>>>>>>> codebase but
>>>>>>>>>>>>>> are connected anyway, like all file parsers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CONS:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - another thing to maintain
>>>>>>>>>>>>>>  - creates possible cross-dependencies (but if you don't want
>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>> just use the existing classes directly)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> python from http://biopython.org/wiki/SeqIO:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> from Bio import SeqIO
>>>>>>>>>>>>>> handle = open("example.fasta", "rU")
>>>>>>>>>>>>>> for record in SeqIO.parse(handle, "fasta") :
>>>>>>>>>>>>>>    print record.id
>>>>>>>>>>>>>> handle.close()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> possible equivalent in biojava (support for streaming API,
>>>>>>>>>>>>>> Iterators,
>>>>>>>>>>>> etc?):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> import org.biojava3.util.SeqIO;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> File file = new File("example.fasta");
>>>>>>>>>>>>>> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>>>>>>>>>>> while (seqIO.hasNext()) {
>>>>>>>>>>>>>>    System.out.println(seqIO.next());
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> file.close();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hannes
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----------------------------------------------------------------------
>>>>>>>>>>>>> Dr. Andreas Prlic
>>>>>>>>>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>>>>>>>>>> University of California, San Diego
>>>>>>>>>>>>> (+1) 858.246.0526
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> biojava-dev mailing list
>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev