[Biojava-dev] Biojava.util package?

David Felty davfelty at gmail.com
Sun Apr 1 01:02:58 UTC 2012


> David, how much experience do you have with Java?
I have 3 years of Java experience. You can look at the code exercise I
sent in recently to get some idea of my skill.

> Also what other parsers you are going to write?
I've been looking into the GenBank, Stockholm, CATH, and UniProt XML
formats, which are mentioned here:
http://biojava.org/wiki/BioJava3_Feature_Requests

> Now you need to look at the parsers in BioJava and have an idea of
> how you are going to unify them.
This is what I was trying to figure out earlier. After looking at
BioPython, I think it might be effective to read files into a common
sequence class (BioPython uses SeqRecord), and then provide utilities
to convert from this sequence to others like DNASequence, RNASequence,
and ProtienSequence. This could avoid some of the complexities of the
Abstract Factory and Builder patterns that are sometimes used in
situations like this. Additionally, it shouldn't be too hard to unify
the current parsers under this system. FastaReader and FastaWriter
already have interfaces that make it easy to extend its functionality,
so they won't be a problem. FastqReader already does something like
what I'm proposing, so it shouldn't be too difficult to adapt either.
The others seem to be somewhat different, so I'll have to examine them
more closely.

David

On Sat, Mar 31, 2012 at 7:38 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>> Does this look like a fair list?
>
> Yes your important feature list makes a lot of sense, though I do not
> think any of the other features do (yes, your code needs to have
> sensible defaults, but also custom function for more specific cases).
> Now you need to look at the parsers in BioJava and have an idea of how
> you are going to unify them.
> Also what other parsers you are going to write?
> We are slipping into implementation here, but Java is OO language,
> although you can store FASTA sequence in a Map, but it is not going to
> be as flexible as a custom object.
> It may be a semantic difference but it is an important one, it is the
> difference between good API and bad API, easy to use or not so easy to
> use code. David, how much experience do you have with Java?
>
> Regards,
> Peter
>
>
> On 31 March 2012 18:16, David Felty <davfelty at gmail.com> wrote:
>> I've been looking at the file parsers for BioPython and BioPerl, and
>> here are some features I've compiled:
>> Important features:
>> - Conversion between file formats
>> - Lazy IO; useful for large files
>> - Use Iterable interface so we get Java foreach over sequences
>> - Index sequences by ID (turn a list of sequences to a map from ID -> seq)
>> - Fetching from remote databases
>>
>> Other features:
>> - Restrict fields needed to speed up parsing; see
>> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
>> - Auto-detect file format (use file extension)
>> - General-purpose API with sensible defaults for most cases, and a
>> more specific but complex API for more control
>> - Index sequences by a user-defined value
>> - Store indexed database files locally (BioPython stores as a SQLite database)
>>
>> Does this look like a fair list? I tried to look for common use cases
>> in BioJava's tutorial, but I only found this page, which comes from
>> BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
>> Are there any other useful sources I could look at? Or perhaps even
>> some real-world code that makes use of parsers?
>>
>> Thanks,
>> David
>>
>> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>> But are there any additional features anyone wants me to consider? Once again, I don't
>>>> have the same experience as many of you, so your input is very
>>>> helpful!
>>>
>>> David, I think this is a pretty good list. Remember you are here into
>>> something more than just a FASTA parser.
>>>
>>>> But here is what I've gathered so far from a combination
>>>> of already-existing code and people's responses:
>>>
>>> I think this is a very good approach.
>>> Look at the existing parsers in BioJava and beyond, the features that
>>> are common will be the most important. Less common will be useful in
>>> some cases but less in others. Come up with a set of use cases and try
>>> using the parsers to achieve them, see how easy (or indeed possible)
>>> it is going to be with various parsers. I appreciate this is a lot of
>>> work, but this way you'll know by heart what a good parser constitutes
>>> of.
>>> You can learn from many implementations to get you own just right.
>>> Once you've done this, you are going to be the expert and will be able
>>> to come up with a list of features in order of importance that your
>>> parser is going to have and have some guesstimate of how long it is
>>> going to take you to implement them. Do not hesitate to ask the
>>> community if there is something you cannot get your heard around.
>>>
>>> Good luck,
>>> Peter
>>>
>>>
>>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>>>I'd suggest you step aside from the
>>>>>details of implementation. Think about what features your parser(s)
>>>>>must have and when how you are going to achieve them
>>>>
>>>> Thank you for this! I now realize that I've been concentrating too
>>>> much on the implementation rather than the features. The
>>>> implementation will be important when (or if) I actually work on the
>>>> project during GSoC, but for now, I'll try to focus on features for my
>>>> proposal.
>>>>
>>>> Unfortunately, I'm not very acquainted with the world of computational
>>>> biology, so I can't be sure what features would be most useful for the
>>>> file parsers. But here is what I've gathered so far from a combination
>>>> of already-existing code and people's responses:
>>>> - Simple api
>>>> - Robust
>>>> - Extensible
>>>> - Good performance
>>>> - Feature-rich
>>>> - Wide variety of parsers
>>>> - Proxy-fetching from remote databases (by ID or location)
>>>> - Local caching
>>>> - Auto-detection of data type
>>>> - Auto-detection of file format
>>>> - Lazy IO
>>>> - Random access file reading
>>>>
>>>> Obviously, these are not all of equal importance, so I'll have to pick
>>>> out the most important ones for my proposal. But are there any
>>>> additional features anyone wants me to consider? Once again, I don't
>>>> have the same experience as many of you, so your input is very
>>>> helpful!
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>> Hi David,
>>>>>
>>>>> Great to see such a discussion! You should see how important your work
>>>>> for Bio community is going to be.
>>>>>
>>>>> Now, what you need to do is to try taking into account what other
>>>>> people were suggesting and put it into your proposal. It's not going
>>>>> to be any good just to add a bunch of opinions; you need to come up
>>>>> with a coherent proposal. For this I'd suggest you step aside from the
>>>>> details of implementation. Think about what features your parser(s)
>>>>> must have and when how you are going to achieve them?
>>>>> I'd suggest that your parsers should be
>>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>>>> - robust
>>>>> - extensible
>>>>> - have good performance
>>>>> - most importantly, have sufficiently rich feature set so that we can
>>>>> replace other parsers (for the same format) in BioJava with yours.
>>>>>
>>>>> Do not forget to split your work in several achievable stages.
>>>>>
>>>>> I'd be careful about transferring the design from Python and
>>>>> especially a decade old Perl implementation straight to Java. While
>>>>> high level concerts may be the similar, implementation details should
>>>>> not be. It’s not that there is anything wrong with these parsers, it
>>>>> just that the languages are different. It is good to know how things
>>>>> are done elsewhere, but I'd suggest that for Java implementation you
>>>>> should be taking inspiration from some well know Java feature. For
>>>>> example, the Java Collections - a set of highly regarded tools for
>>>>> working with various collections of objects. Also do some reading on
>>>>> Java enums, your proposed implementation will definitely benefit from
>>>>> using them.
>>>>>
>>>>> Have fun,
>>>>>
>>>>> Regards,
>>>>> Peter
>>>>>
>>>>>
>>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>>>> Hey Andreas,
>>>>>>
>>>>>> It it wouldn't be too difficult to make a method that can infer the
>>>>>> file type using the file extension. In fact, it looks like BioPerl's
>>>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
>>>>>> takes the route of "explicit is better than implicit," and requires
>>>>>> that you explicitly give the format. Perhaps BioJava could take both
>>>>>> routes, and have an overloaded parse method that infers the file type,
>>>>>> along with the regular explicit method.
>>>>>>
>>>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
>>>>>> http://pastebin.com/KLcpq8Qb
>>>>>> This would work similarly:
>>>>>>
>>>>>> InputStream is = ...
>>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
>>>>>>
>>>>>>
>>>>>> It looks like the other sequence readers aren't as clear-cut, so they
>>>>>> may need a bit more wrapping before they can be adapted to this
>>>>>> method. A common problem is that sequence readers don't return a
>>>>>> specific type of sequence, like with
>>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>>>> just contains the sequence data in itself. We might want to create
>>>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>>>
>>>>>> I'll look into this more later, I have to go to class.
>>>>>>
>>>>>> Regards,
>>>>>> David
>>>>>>
>>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> so far it still feels like a wrapper for what is already there. Try to
>>>>>>> take it to the next level. Why does the user still need to provide the
>>>>>>> type of file, can't this be auto-detected? What is the behaviour for
>>>>>>> non-fasta files, what can be supported and where are the limits, etc.
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
>>>>>>> > I've actually been working on something like this for my GSoC proposal,
>>>>>>> > here's what I came up with:
>>>>>>> >
>>>>>>> > public class SeqIO {
>>>>>>> >    public static final int FASTA = 0;
>>>>>>> >    public static final int FASTQ = 1;
>>>>>>> >    public static final Class<DNASequence> DNA = DNASequence.class;
>>>>>>> >    public static final Class<ProteinSequence> PROTEIN =
>>>>>>> > ProteinSequence.class;
>>>>>>> >
>>>>>>> >    public static <S extends Sequence> Iterable<S> parse(InputStream is,
>>>>>>> > int fileFormat, Class<S> seqType) throws Exception {
>>>>>>> >        switch (fileFormat) {
>>>>>>> >            case FASTA:
>>>>>>> >                if (seqType == DNA) {
>>>>>>> >                    return (Iterable<S>)
>>>>>>> > FastaReaderHelper.readFastaDNASequence(is);
>>>>>>> >                } else if (seqType == PROTEIN) {
>>>>>>> >                    // etc...
>>>>>>> >                }
>>>>>>> > break;
>>>>>>> >            case FASTQ:
>>>>>>> >                // etc...
>>>>>>> >        }
>>>>>>> >    }
>>>>>>> > }
>>>>>>> >
>>>>>>> > It would be used like so:
>>>>>>> >
>>>>>>> > InputStream is = ...
>>>>>>> > Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
>>>>>>> > for (DNASequence s : seqs) {
>>>>>>> >   // do something
>>>>>>> > }
>>>>>>> >
>>>>>>> > Obviously it's not the prettiest and a lot could be changed, but that's my
>>>>>>> > initial design. I tried to base it off BioPython's SeqIO, but static typing
>>>>>>> > and the variety of Sequence types forced me to put in some nasty generics.
>>>>>>> > Any tips would be appreciated!
>>>>>>> >
>>>>>>> > David
>>>>>>> >
>>>>>>> > On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>>>> > biojava at hannes.oib.com> wrote:
>>>>>>> >
>>>>>>> >> Yes, something like a simplifying and unifying wrapper would be what I
>>>>>>> >> am thinking of.
>>>>>>> >>
>>>>>>> >> Hannes
>>>>>>> >>
>>>>>>> >> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> >> > Hi Hannes,
>>>>>>> >> >
>>>>>>> >> > I guess this is pretty similar to:
>>>>>>> >> >
>>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>>>> >> >
>>>>>>> >> > we have also been using "proxy" objects to fetch sequence data on the fly
>>>>>>> >> >
>>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>>>> >> >
>>>>>>> >> > As such I think we should discuss this a bit more. If we can find a
>>>>>>> >> > common api that is simple and works with both local files as well as
>>>>>>> >> > remote proxy objects, that would be nice. There should be no need to
>>>>>>> >> > change much of the existing code, but perhaps there could be a
>>>>>>> >> > simplified wrapper for what is already there.
>>>>>>> >> >
>>>>>>> >> >  Andreas
>>>>>>> >> >
>>>>>>> >> > On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>>>> >> > <biojava at hannes.oib.com> wrote:
>>>>>>> >> >> Hi,
>>>>>>> >> >>
>>>>>>> >> >> I browsed around in the sister projects Biopython and Bioperl a bit,
>>>>>>> >> >> and noticed that many of the user interaction with the code goes
>>>>>>> >> >> through classes like SeqIO, SearchIO, AlignIO...
>>>>>>> >> >>
>>>>>>> >> >> So that got me thinking: how about we create similar "Interface"
>>>>>>> >> >> classes in Biojava?
>>>>>>> >> >>
>>>>>>> >> >> PROS:
>>>>>>> >> >>
>>>>>>> >> >>  - easy change for programmers who switch languages
>>>>>>> >> >>  - easy base interface that can be used in cookbook examples
>>>>>>> >> >>  - makes code more readable if designed properly
>>>>>>> >> >>  - easy access to features that are spread over the whole codebase but
>>>>>>> >> >> are connected anyway, like all file parsers
>>>>>>> >> >>
>>>>>>> >> >> CONS:
>>>>>>> >> >>
>>>>>>> >> >>  - another thing to maintain
>>>>>>> >> >>  - creates possible cross-dependencies (but if you don't want that,
>>>>>>> >> >> just use the existing classes directly)
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> What are your thoughts?
>>>>>>> >> >>
>>>>>>> >> >> python from http://biopython.org/wiki/SeqIO:
>>>>>>> >> >>
>>>>>>> >> >> from Bio import SeqIO
>>>>>>> >> >> handle = open("example.fasta", "rU")
>>>>>>> >> >> for record in SeqIO.parse(handle, "fasta") :
>>>>>>> >> >>    print record.id
>>>>>>> >> >> handle.close()
>>>>>>> >> >>
>>>>>>> >> >> possible equivalent in biojava (support for streaming API, Iterators,
>>>>>>> >> etc?):
>>>>>>> >> >>
>>>>>>> >> >> import org.biojava3.util.SeqIO;
>>>>>>> >> >>
>>>>>>> >> >> File file = new File("example.fasta");
>>>>>>> >> >> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>>>> >> >> while (seqIO.hasNext()) {
>>>>>>> >> >>    System.out.println(seqIO.next());
>>>>>>> >> >> }
>>>>>>> >> >> file.close();
>>>>>>> >> >>
>>>>>>> >> >> Hannes
>>>>>>> >> >> _______________________________________________
>>>>>>> >> >> biojava-dev mailing list
>>>>>>> >> >> biojava-dev at lists.open-bio.org
>>>>>>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > --
>>>>>>> >> > -----------------------------------------------------------------------
>>>>>>> >> > Dr. Andreas Prlic
>>>>>>> >> > Senior Scientist, RCSB PDB Protein Data Bank
>>>>>>> >> > University of California, San Diego
>>>>>>> >> > (+1) 858.246.0526
>>>>>>> >> > -----------------------------------------------------------------------
>>>>>>> >>
>>>>>>> >> _______________________________________________
>>>>>>> >> biojava-dev mailing list
>>>>>>> >> biojava-dev at lists.open-bio.org
>>>>>>> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>> >>
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > biojava-dev mailing list
>>>>>>> > biojava-dev at lists.open-bio.org
>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>
>>>>>> _______________________________________________
>>>>>> biojava-dev mailing list
>>>>>> biojava-dev at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list