[Biojava-dev] Biojava.util package?

Sun Apr 1 17:03:29 UTC 2012

>> Also what other parsers you are going to write?
> I've been looking into the GenBank, Stockholm, CATH, and UniProt XML
> formats, which are mentioned here:
> http://biojava.org/wiki/BioJava3_Feature_Request

These are good suggestions. Also could you have a look at more
multiple sequence alignment formats, e.g. PIR, PFAM, Stockholm, MSF,
Clustal? Sequence features parser like GFF
(http://www.sanger.ac.uk/resources/software/gff/spec.html) might be
useful too. Phylogeny parsers e.g. Newick tree file parser etc. As for
the Genbank parser, I think we should focus on the XML version of the
Genbank file as this is now widely available and use standard Java XML
readers for the implementation.

Regards,
Peter

>> Now you need to look at the parsers in BioJava and have an idea of
>> how you are going to unify them.
> This is what I was trying to figure out earlier. After looking at
> BioPython, I think it might be effective to read files into a common
> sequence class (BioPython uses SeqRecord), and then provide utilities
> to convert from this sequence to others like DNASequence, RNASequence,
> and ProtienSequence. This could avoid some of the complexities of the
> Abstract Factory and Builder patterns that are sometimes used in
> situations like this. Additionally, it shouldn't be too hard to unify
> the current parsers under this system. FastaReader and FastaWriter
> already have interfaces that make it easy to extend its functionality,
> so they won't be a problem. FastqReader already does something like
> what I'm proposing, so it shouldn't be too difficult to adapt either.
> The others seem to be somewhat different, so I'll have to examine them
> more closely.
>
> David
>
> On Sat, Mar 31, 2012 at 7:38 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>> Does this look like a fair list?
>>
>> Yes your important feature list makes a lot of sense, though I do not
>> think any of the other features do (yes, your code needs to have
>> sensible defaults, but also custom function for more specific cases).
>> Now you need to look at the parsers in BioJava and have an idea of how
>> you are going to unify them.
>> Also what other parsers you are going to write?
>> We are slipping into implementation here, but Java is OO language,
>> although you can store FASTA sequence in a Map, but it is not going to
>> be as flexible as a custom object.
>> It may be a semantic difference but it is an important one, it is the
>> difference between good API and bad API, easy to use or not so easy to
>> use code. David, how much experience do you have with Java?
>>
>> Regards,
>> Peter
>>
>>
>> On 31 March 2012 18:16, David Felty <davfelty at gmail.com> wrote:
>>> I've been looking at the file parsers for BioPython and BioPerl, and
>>> here are some features I've compiled:
>>> Important features:
>>> - Conversion between file formats
>>> - Lazy IO; useful for large files
>>> - Use Iterable interface so we get Java foreach over sequences
>>> - Index sequences by ID (turn a list of sequences to a map from ID -> seq)
>>> - Fetching from remote databases
>>>
>>> Other features:
>>> - Restrict fields needed to speed up parsing; see
>>> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
>>> - Auto-detect file format (use file extension)
>>> - General-purpose API with sensible defaults for most cases, and a
>>> more specific but complex API for more control
>>> - Index sequences by a user-defined value
>>> - Store indexed database files locally (BioPython stores as a SQLite database)
>>>
>>> Does this look like a fair list? I tried to look for common use cases
>>> in BioJava's tutorial, but I only found this page, which comes from
>>> BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
>>> Are there any other useful sources I could look at? Or perhaps even
>>> some real-world code that makes use of parsers?
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>> But are there any additional features anyone wants me to consider? Once again, I don't
>>>>> have the same experience as many of you, so your input is very
>>>>> helpful!
>>>>
>>>> David, I think this is a pretty good list. Remember you are here into
>>>> something more than just a FASTA parser.
>>>>
>>>>> But here is what I've gathered so far from a combination
>>>>> of already-existing code and people's responses:
>>>>
>>>> I think this is a very good approach.
>>>> Look at the existing parsers in BioJava and beyond, the features that
>>>> are common will be the most important. Less common will be useful in
>>>> some cases but less in others. Come up with a set of use cases and try
>>>> using the parsers to achieve them, see how easy (or indeed possible)
>>>> it is going to be with various parsers. I appreciate this is a lot of
>>>> work, but this way you'll know by heart what a good parser constitutes
>>>> of.
>>>> You can learn from many implementations to get you own just right.
>>>> Once you've done this, you are going to be the expert and will be able
>>>> to come up with a list of features in order of importance that your
>>>> parser is going to have and have some guesstimate of how long it is
>>>> going to take you to implement them. Do not hesitate to ask the
>>>> community if there is something you cannot get your heard around.
>>>>
>>>> Good luck,
>>>> Peter
>>>>
>>>>
>>>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>>>>I'd suggest you step aside from the
>>>>>>details of implementation. Think about what features your parser(s)
>>>>>>must have and when how you are going to achieve them
>>>>>
>>>>> Thank you for this! I now realize that I've been concentrating too
>>>>> much on the implementation rather than the features. The
>>>>> implementation will be important when (or if) I actually work on the
>>>>> project during GSoC, but for now, I'll try to focus on features for my
>>>>> proposal.
>>>>>
>>>>> Unfortunately, I'm not very acquainted with the world of computational
>>>>> biology, so I can't be sure what features would be most useful for the
>>>>> file parsers. But here is what I've gathered so far from a combination
>>>>> of already-existing code and people's responses:
>>>>> - Simple api
>>>>> - Robust
>>>>> - Extensible
>>>>> - Good performance
>>>>> - Feature-rich
>>>>> - Wide variety of parsers
>>>>> - Proxy-fetching from remote databases (by ID or location)
>>>>> - Local caching
>>>>> - Auto-detection of data type
>>>>> - Auto-detection of file format
>>>>> - Lazy IO
>>>>> - Random access file reading
>>>>>
>>>>> Obviously, these are not all of equal importance, so I'll have to pick
>>>>> out the most important ones for my proposal. But are there any
>>>>> additional features anyone wants me to consider? Once again, I don't
>>>>> have the same experience as many of you, so your input is very
>>>>> helpful!
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> Great to see such a discussion! You should see how important your work
>>>>>> for Bio community is going to be.
>>>>>>
>>>>>> Now, what you need to do is to try taking into account what other
>>>>>> people were suggesting and put it into your proposal. It's not going
>>>>>> to be any good just to add a bunch of opinions; you need to come up
>>>>>> with a coherent proposal. For this I'd suggest you step aside from the
>>>>>> details of implementation. Think about what features your parser(s)
>>>>>> must have and when how you are going to achieve them?
>>>>>> I'd suggest that your parsers should be
>>>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>>>>> - robust
>>>>>> - extensible
>>>>>> - have good performance
>>>>>> - most importantly, have sufficiently rich feature set so that we can
>>>>>> replace other parsers (for the same format) in BioJava with yours.
>>>>>>
>>>>>> Do not forget to split your work in several achievable stages.
>>>>>>
>>>>>> I'd be careful about transferring the design from Python and
>>>>>> especially a decade old Perl implementation straight to Java. While
>>>>>> high level concerts may be the similar, implementation details should
>>>>>> not be. It’s not that there is anything wrong with these parsers, it
>>>>>> just that the languages are different. It is good to know how things
>>>>>> are done elsewhere, but I'd suggest that for Java implementation you
>>>>>> should be taking inspiration from some well know Java feature. For
>>>>>> example, the Java Collections - a set of highly regarded tools for
>>>>>> working with various collections of objects. Also do some reading on
>>>>>> Java enums, your proposed implementation will definitely benefit from
>>>>>> using them.
>>>>>>
>>>>>> Have fun,
>>>>>>
>>>>>> Regards,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>>>>> Hey Andreas,
>>>>>>>
>>>>>>> It it wouldn't be too difficult to make a method that can infer the
>>>>>>> file type using the file extension. In fact, it looks like BioPerl's
>>>>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
>>>>>>> takes the route of "explicit is better than implicit," and requires
>>>>>>> that you explicitly give the format. Perhaps BioJava could take both
>>>>>>> routes, and have an overloaded parse method that infers the file type,
>>>>>>> along with the regular explicit method.
>>>>>>>
>>>>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
>>>>>>> http://pastebin.com/KLcpq8Qb
>>>>>>> This would work similarly:
>>>>>>>
>>>>>>> InputStream is = ...
>>>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
>>>>>>>
>>>>>>>
>>>>>>> It looks like the other sequence readers aren't as clear-cut, so they
>>>>>>> may need a bit more wrapping before they can be adapted to this
>>>>>>> method. A common problem is that sequence readers don't return a
>>>>>>> specific type of sequence, like with
>>>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>>>>> just contains the sequence data in itself. We might want to create
>>>>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>>>>
>>>>>>> I'll look into this more later, I have to go to class.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David
>>>>>>>
>>>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> so far it still feels like a wrapper for what is already there. Try to
>>>>>>>> take it to the next level. Why does the user still need to provide the
>>>>>>>> type of file, can't this be auto-detected? What is the behaviour for
>>>>>>>> non-fasta files, what can be supported and where are the limits, etc.
>>>>>>>>
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
>>>>>>>> > I've actually been working on something like this for my GSoC proposal,
>>>>>>>> > here's what I came up with:
>>>>>>>> >
>>>>>>>> > public class SeqIO {
>>>>>>>> >    public static final int FASTA = 0;
>>>>>>>> >    public static final int FASTQ = 1;
>>>>>>>> >    public static final Class<DNASequence> DNA = DNASequence.class;
>>>>>>>> >    public static final Class<ProteinSequence> PROTEIN =
>>>>>>>> > ProteinSequence.class;
>>>>>>>> >
>>>>>>>> >    public static <S extends Sequence> Iterable<S> parse(InputStream is,
>>>>>>>> > int fileFormat, Class<S> seqType) throws Exception {
>>>>>>>> >        switch (fileFormat) {
>>>>>>>> >            case FASTA:
>>>>>>>> >                if (seqType == DNA) {
>>>>>>>> >                    return (Iterable<S>)
>>>>>>>> > FastaReaderHelper.readFastaDNASequence(is);
>>>>>>>> >                } else if (seqType == PROTEIN) {
>>>>>>>> >                    // etc...
>>>>>>>> >                }
>>>>>>>> > break;
>>>>>>>> >            case FASTQ:
>>>>>>>> >                // etc...
>>>>>>>> >        }
>>>>>>>> >    }
>>>>>>>> > }
>>>>>>>> >
>>>>>>>> > It would be used like so:
>>>>>>>> >
>>>>>>>> > InputStream is = ...
>>>>>>>> > Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
>>>>>>>> > for (DNASequence s : seqs) {
>>>>>>>> >   // do something
>>>>>>>> > }
>>>>>>>> >
>>>>>>>> > Obviously it's not the prettiest and a lot could be changed, but that's my
>>>>>>>> > initial design. I tried to base it off BioPython's SeqIO, but static typing
>>>>>>>> > and the variety of Sequence types forced me to put in some nasty generics.
>>>>>>>> > Any tips would be appreciated!
>>>>>>>> >
>>>>>>>> > David
>>>>>>>> >
>>>>>>>> > On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>>>>> > biojava at hannes.oib.com> wrote:
>>>>>>>> >
>>>>>>>> >> Yes, something like a simplifying and unifying wrapper would be what I
>>>>>>>> >> am thinking of.
>>>>>>>> >>
>>>>>>>> >> Hannes
>>>>>>>> >>
>>>>>>>> >> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>> >> > Hi Hannes,
>>>>>>>> >> >
>>>>>>>> >> > I guess this is pretty similar to:
>>>>>>>> >> >
>>>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>>>>> >> >
>>>>>>>> >> > we have also been using "proxy" objects to fetch sequence data on the fly
>>>>>>>> >> >
>>>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>>>>> >> >
>>>>>>>> >> > As such I think we should discuss this a bit more. If we can find a
>>>>>>>> >> > common api that is simple and works with both local files as well as
>>>>>>>> >> > remote proxy objects, that would be nice. There should be no need to
>>>>>>>> >> > change much of the existing code, but perhaps there could be a
>>>>>>>> >> > simplified wrapper for what is already there.
>>>>>>>> >> >
>>>>>>>> >> >  Andreas
>>>>>>>> >> >
>>>>>>>> >> > On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>>>>> >> > <biojava at hannes.oib.com> wrote:
>>>>>>>> >> >> Hi,
>>>>>>>> >> >>
>>>>>>>> >> >> I browsed around in the sister projects Biopython and Bioperl a bit,
>>>>>>>> >> >> and noticed that many of the user interaction with the code goes
>>>>>>>> >> >> through classes like SeqIO, SearchIO, AlignIO...
>>>>>>>> >> >>
>>>>>>>> >> >> So that got me thinking: how about we create similar "Interface"
>>>>>>>> >> >> classes in Biojava?
>>>>>>>> >> >>
>>>>>>>> >> >> PROS:
>>>>>>>> >> >>
>>>>>>>> >> >>  - easy change for programmers who switch languages
>>>>>>>> >> >>  - easy base interface that can be used in cookbook examples
>>>>>>>> >> >>  - makes code more readable if designed properly
>>>>>>>> >> >>  - easy access to features that are spread over the whole codebase but
>>>>>>>> >> >> are connected anyway, like all file parsers
>>>>>>>> >> >>
>>>>>>>> >> >> CONS:
>>>>>>>> >> >>
>>>>>>>> >> >>  - another thing to maintain
>>>>>>>> >> >>  - creates possible cross-dependencies (but if you don't want that,
>>>>>>>> >> >> just use the existing classes directly)
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> What are your thoughts?
>>>>>>>> >> >>
>>>>>>>> >> >> python from http://biopython.org/wiki/SeqIO:
>>>>>>>> >> >>
>>>>>>>> >> >> from Bio import SeqIO
>>>>>>>> >> >> handle = open("example.fasta", "rU")
>>>>>>>> >> >> for record in SeqIO.parse(handle, "fasta") :
>>>>>>>> >> >>    print record.id
>>>>>>>> >> >> handle.close()
>>>>>>>> >> >>
>>>>>>>> >> >> possible equivalent in biojava (support for streaming API, Iterators,
>>>>>>>> >> etc?):
>>>>>>>> >> >>
>>>>>>>> >> >> import org.biojava3.util.SeqIO;
>>>>>>>> >> >>
>>>>>>>> >> >> File file = new File("example.fasta");
>>>>>>>> >> >> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>>>>> >> >> while (seqIO.hasNext()) {
>>>>>>>> >> >>    System.out.println(seqIO.next());
>>>>>>>> >> >> }
>>>>>>>> >> >> file.close();
>>>>>>>> >> >>
>>>>>>>> >> >> Hannes
>>>>>>>> >> >> _______________________________________________
>>>>>>>> >> >> biojava-dev mailing list
>>>>>>>> >> >> biojava-dev at lists.open-bio.org
>>>>>>>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> > --
>>>>>>>> >> > -----------------------------------------------------------------------
>>>>>>>> >> > Dr. Andreas Prlic
>>>>>>>> >> > Senior Scientist, RCSB PDB Protein Data Bank
>>>>>>>> >> > University of California, San Diego
>>>>>>>> >> > (+1) 858.246.0526
>>>>>>>> >> > -----------------------------------------------------------------------
>>>>>>>> >>
>>>>>>>> >> _______________________________________________
>>>>>>>> >> biojava-dev mailing list
>>>>>>>> >> biojava-dev at lists.open-bio.org
>>>>>>>> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> > _______________________________________________
>>>>>>>> > biojava-dev mailing list
>>>>>>>> > biojava-dev at lists.open-bio.org
>>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> biojava-dev mailing list
>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev