[Biojava-dev] Biojava.util package?

Andy Yates ayates at ebi.ac.uk
Sun Apr 1 17:42:03 UTC 2012


Hi

This is the latest spec for GFF3

http://www.sequenceontology.org/gff3.shtml

All the best,

Andy

Sent from my mobile.

On 1 Apr 2012, at 18:03, "P. Troshin" <to.petr at gmail.com> wrote:

>>> Also what other parsers you are going to write?
>> I've been looking into the GenBank, Stockholm, CATH, and UniProt XML
>> formats, which are mentioned here:
>> http://biojava.org/wiki/BioJava3_Feature_Request
> 
> These are good suggestions. Also could you have a look at more
> multiple sequence alignment formats, e.g. PIR, PFAM, Stockholm, MSF,
> Clustal? Sequence features parser like GFF
> (http://www.sanger.ac.uk/resources/software/gff/spec.html) might be
> useful too. Phylogeny parsers e.g. Newick tree file parser etc. As for
> the Genbank parser, I think we should focus on the XML version of the
> Genbank file as this is now widely available and use standard Java XML
> readers for the implementation.
> 
> Regards,
> Peter
> 
>>> Now you need to look at the parsers in BioJava and have an idea of
>>> how you are going to unify them.
>> This is what I was trying to figure out earlier. After looking at
>> BioPython, I think it might be effective to read files into a common
>> sequence class (BioPython uses SeqRecord), and then provide utilities
>> to convert from this sequence to others like DNASequence, RNASequence,
>> and ProtienSequence. This could avoid some of the complexities of the
>> Abstract Factory and Builder patterns that are sometimes used in
>> situations like this. Additionally, it shouldn't be too hard to unify
>> the current parsers under this system. FastaReader and FastaWriter
>> already have interfaces that make it easy to extend its functionality,
>> so they won't be a problem. FastqReader already does something like
>> what I'm proposing, so it shouldn't be too difficult to adapt either.
>> The others seem to be somewhat different, so I'll have to examine them
>> more closely.
>> 
>> David
>> 
>> On Sat, Mar 31, 2012 at 7:38 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>> Does this look like a fair list?
>>> 
>>> Yes your important feature list makes a lot of sense, though I do not
>>> think any of the other features do (yes, your code needs to have
>>> sensible defaults, but also custom function for more specific cases).
>>> Now you need to look at the parsers in BioJava and have an idea of how
>>> you are going to unify them.
>>> Also what other parsers you are going to write?
>>> We are slipping into implementation here, but Java is OO language,
>>> although you can store FASTA sequence in a Map, but it is not going to
>>> be as flexible as a custom object.
>>> It may be a semantic difference but it is an important one, it is the
>>> difference between good API and bad API, easy to use or not so easy to
>>> use code. David, how much experience do you have with Java?
>>> 
>>> Regards,
>>> Peter
>>> 
>>> 
>>> On 31 March 2012 18:16, David Felty <davfelty at gmail.com> wrote:
>>>> I've been looking at the file parsers for BioPython and BioPerl, and
>>>> here are some features I've compiled:
>>>> Important features:
>>>> - Conversion between file formats
>>>> - Lazy IO; useful for large files
>>>> - Use Iterable interface so we get Java foreach over sequences
>>>> - Index sequences by ID (turn a list of sequences to a map from ID -> seq)
>>>> - Fetching from remote databases
>>>> 
>>>> Other features:
>>>> - Restrict fields needed to speed up parsing; see
>>>> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
>>>> - Auto-detect file format (use file extension)
>>>> - General-purpose API with sensible defaults for most cases, and a
>>>> more specific but complex API for more control
>>>> - Index sequences by a user-defined value
>>>> - Store indexed database files locally (BioPython stores as a SQLite database)
>>>> 
>>>> Does this look like a fair list? I tried to look for common use cases
>>>> in BioJava's tutorial, but I only found this page, which comes from
>>>> BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
>>>> Are there any other useful sources I could look at? Or perhaps even
>>>> some real-world code that makes use of parsers?
>>>> 
>>>> Thanks,
>>>> David
>>>> 
>>>> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>> But are there any additional features anyone wants me to consider? Once again, I don't
>>>>>> have the same experience as many of you, so your input is very
>>>>>> helpful!
>>>>> 
>>>>> David, I think this is a pretty good list. Remember you are here into
>>>>> something more than just a FASTA parser.
>>>>> 
>>>>>> But here is what I've gathered so far from a combination
>>>>>> of already-existing code and people's responses:
>>>>> 
>>>>> I think this is a very good approach.
>>>>> Look at the existing parsers in BioJava and beyond, the features that
>>>>> are common will be the most important. Less common will be useful in
>>>>> some cases but less in others. Come up with a set of use cases and try
>>>>> using the parsers to achieve them, see how easy (or indeed possible)
>>>>> it is going to be with various parsers. I appreciate this is a lot of
>>>>> work, but this way you'll know by heart what a good parser constitutes
>>>>> of.
>>>>> You can learn from many implementations to get you own just right.
>>>>> Once you've done this, you are going to be the expert and will be able
>>>>> to come up with a list of features in order of importance that your
>>>>> parser is going to have and have some guesstimate of how long it is
>>>>> going to take you to implement them. Do not hesitate to ask the
>>>>> community if there is something you cannot get your heard around.
>>>>> 
>>>>> Good luck,
>>>>> Peter
>>>>> 
>>>>> 
>>>>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>>>>> I'd suggest you step aside from the
>>>>>>> details of implementation. Think about what features your parser(s)
>>>>>>> must have and when how you are going to achieve them
>>>>>> 
>>>>>> Thank you for this! I now realize that I've been concentrating too
>>>>>> much on the implementation rather than the features. The
>>>>>> implementation will be important when (or if) I actually work on the
>>>>>> project during GSoC, but for now, I'll try to focus on features for my
>>>>>> proposal.
>>>>>> 
>>>>>> Unfortunately, I'm not very acquainted with the world of computational
>>>>>> biology, so I can't be sure what features would be most useful for the
>>>>>> file parsers. But here is what I've gathered so far from a combination
>>>>>> of already-existing code and people's responses:
>>>>>> - Simple api
>>>>>> - Robust
>>>>>> - Extensible
>>>>>> - Good performance
>>>>>> - Feature-rich
>>>>>> - Wide variety of parsers
>>>>>> - Proxy-fetching from remote databases (by ID or location)
>>>>>> - Local caching
>>>>>> - Auto-detection of data type
>>>>>> - Auto-detection of file format
>>>>>> - Lazy IO
>>>>>> - Random access file reading
>>>>>> 
>>>>>> Obviously, these are not all of equal importance, so I'll have to pick
>>>>>> out the most important ones for my proposal. But are there any
>>>>>> additional features anyone wants me to consider? Once again, I don't
>>>>>> have the same experience as many of you, so your input is very
>>>>>> helpful!
>>>>>> 
>>>>>> Thanks,
>>>>>> David
>>>>>> 
>>>>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>>>>> Hi David,
>>>>>>> 
>>>>>>> Great to see such a discussion! You should see how important your work
>>>>>>> for Bio community is going to be.
>>>>>>> 
>>>>>>> Now, what you need to do is to try taking into account what other
>>>>>>> people were suggesting and put it into your proposal. It's not going
>>>>>>> to be any good just to add a bunch of opinions; you need to come up
>>>>>>> with a coherent proposal. For this I'd suggest you step aside from the
>>>>>>> details of implementation. Think about what features your parser(s)
>>>>>>> must have and when how you are going to achieve them?
>>>>>>> I'd suggest that your parsers should be
>>>>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>>>>>> - robust
>>>>>>> - extensible
>>>>>>> - have good performance
>>>>>>> - most importantly, have sufficiently rich feature set so that we can
>>>>>>> replace other parsers (for the same format) in BioJava with yours.
>>>>>>> 
>>>>>>> Do not forget to split your work in several achievable stages.
>>>>>>> 
>>>>>>> I'd be careful about transferring the design from Python and
>>>>>>> especially a decade old Perl implementation straight to Java. While
>>>>>>> high level concerts may be the similar, implementation details should
>>>>>>> not be. It’s not that there is anything wrong with these parsers, it
>>>>>>> just that the languages are different. It is good to know how things
>>>>>>> are done elsewhere, but I'd suggest that for Java implementation you
>>>>>>> should be taking inspiration from some well know Java feature. For
>>>>>>> example, the Java Collections - a set of highly regarded tools for
>>>>>>> working with various collections of objects. Also do some reading on
>>>>>>> Java enums, your proposed implementation will definitely benefit from
>>>>>>> using them.
>>>>>>> 
>>>>>>> Have fun,
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Peter
>>>>>>> 
>>>>>>> 
>>>>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>>>>>> Hey Andreas,
>>>>>>>> 
>>>>>>>> It it wouldn't be too difficult to make a method that can infer the
>>>>>>>> file type using the file extension. In fact, it looks like BioPerl's
>>>>>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
>>>>>>>> takes the route of "explicit is better than implicit," and requires
>>>>>>>> that you explicitly give the format. Perhaps BioJava could take both
>>>>>>>> routes, and have an overloaded parse method that infers the file type,
>>>>>>>> along with the regular explicit method.
>>>>>>>> 
>>>>>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
>>>>>>>> http://pastebin.com/KLcpq8Qb
>>>>>>>> This would work similarly:
>>>>>>>> 
>>>>>>>> InputStream is = ...
>>>>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It looks like the other sequence readers aren't as clear-cut, so they
>>>>>>>> may need a bit more wrapping before they can be adapted to this
>>>>>>>> method. A common problem is that sequence readers don't return a
>>>>>>>> specific type of sequence, like with
>>>>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>>>>>> just contains the sequence data in itself. We might want to create
>>>>>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>>>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>>>>> 
>>>>>>>> I'll look into this more later, I have to go to class.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> David
>>>>>>>> 
>>>>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>> 
>>>>>>>>> Hi David,
>>>>>>>>> 
>>>>>>>>> so far it still feels like a wrapper for what is already there. Try to
>>>>>>>>> take it to the next level. Why does the user still need to provide the
>>>>>>>>> type of file, can't this be auto-detected? What is the behaviour for
>>>>>>>>> non-fasta files, what can be supported and where are the limits, etc.
>>>>>>>>> 
>>>>>>>>> Andreas
>>>>>>>>> 
>>>>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
>>>>>>>>>> I've actually been working on something like this for my GSoC proposal,
>>>>>>>>>> here's what I came up with:
>>>>>>>>>> 
>>>>>>>>>> public class SeqIO {
>>>>>>>>>>    public static final int FASTA = 0;
>>>>>>>>>>    public static final int FASTQ = 1;
>>>>>>>>>>    public static final Class<DNASequence> DNA = DNASequence.class;
>>>>>>>>>>    public static final Class<ProteinSequence> PROTEIN =
>>>>>>>>>> ProteinSequence.class;
>>>>>>>>>> 
>>>>>>>>>>    public static <S extends Sequence> Iterable<S> parse(InputStream is,
>>>>>>>>>> int fileFormat, Class<S> seqType) throws Exception {
>>>>>>>>>>        switch (fileFormat) {
>>>>>>>>>>            case FASTA:
>>>>>>>>>>                if (seqType == DNA) {
>>>>>>>>>>                    return (Iterable<S>)
>>>>>>>>>> FastaReaderHelper.readFastaDNASequence(is);
>>>>>>>>>>                } else if (seqType == PROTEIN) {
>>>>>>>>>>                    // etc...
>>>>>>>>>>                }
>>>>>>>>>> break;
>>>>>>>>>>            case FASTQ:
>>>>>>>>>>                // etc...
>>>>>>>>>>        }
>>>>>>>>>>    }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> It would be used like so:
>>>>>>>>>> 
>>>>>>>>>> InputStream is = ...
>>>>>>>>>> Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
>>>>>>>>>> for (DNASequence s : seqs) {
>>>>>>>>>>   // do something
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Obviously it's not the prettiest and a lot could be changed, but that's my
>>>>>>>>>> initial design. I tried to base it off BioPython's SeqIO, but static typing
>>>>>>>>>> and the variety of Sequence types forced me to put in some nasty generics.
>>>>>>>>>> Any tips would be appreciated!
>>>>>>>>>> 
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>>>>>>> biojava at hannes.oib.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Yes, something like a simplifying and unifying wrapper would be what I
>>>>>>>>>>> am thinking of.
>>>>>>>>>>> 
>>>>>>>>>>> Hannes
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>>>>> Hi Hannes,
>>>>>>>>>>>> 
>>>>>>>>>>>> I guess this is pretty similar to:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>>>>>>>>> 
>>>>>>>>>>>> we have also been using "proxy" objects to fetch sequence data on the fly
>>>>>>>>>>>> 
>>>>>>>>>>>> http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>>>>>>>>> 
>>>>>>>>>>>> As such I think we should discuss this a bit more. If we can find a
>>>>>>>>>>>> common api that is simple and works with both local files as well as
>>>>>>>>>>>> remote proxy objects, that would be nice. There should be no need to
>>>>>>>>>>>> change much of the existing code, but perhaps there could be a
>>>>>>>>>>>> simplified wrapper for what is already there.
>>>>>>>>>>>> 
>>>>>>>>>>>>  Andreas
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>>>>>>>>> <biojava at hannes.oib.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I browsed around in the sister projects Biopython and Bioperl a bit,
>>>>>>>>>>>>> and noticed that many of the user interaction with the code goes
>>>>>>>>>>>>> through classes like SeqIO, SearchIO, AlignIO...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So that got me thinking: how about we create similar "Interface"
>>>>>>>>>>>>> classes in Biojava?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PROS:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  - easy change for programmers who switch languages
>>>>>>>>>>>>>  - easy base interface that can be used in cookbook examples
>>>>>>>>>>>>>  - makes code more readable if designed properly
>>>>>>>>>>>>>  - easy access to features that are spread over the whole codebase but
>>>>>>>>>>>>> are connected anyway, like all file parsers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> CONS:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  - another thing to maintain
>>>>>>>>>>>>>  - creates possible cross-dependencies (but if you don't want that,
>>>>>>>>>>>>> just use the existing classes directly)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> python from http://biopython.org/wiki/SeqIO:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> from Bio import SeqIO
>>>>>>>>>>>>> handle = open("example.fasta", "rU")
>>>>>>>>>>>>> for record in SeqIO.parse(handle, "fasta") :
>>>>>>>>>>>>>    print record.id
>>>>>>>>>>>>> handle.close()
>>>>>>>>>>>>> 
>>>>>>>>>>>>> possible equivalent in biojava (support for streaming API, Iterators,
>>>>>>>>>>> etc?):
>>>>>>>>>>>>> 
>>>>>>>>>>>>> import org.biojava3.util.SeqIO;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> File file = new File("example.fasta");
>>>>>>>>>>>>> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>>>>>>>>>> while (seqIO.hasNext()) {
>>>>>>>>>>>>>    System.out.println(seqIO.next());
>>>>>>>>>>>>> }
>>>>>>>>>>>>> file.close();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hannes
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> -----------------------------------------------------------------------
>>>>>>>>>>>> Dr. Andreas Prlic
>>>>>>>>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>>>>>>>>> University of California, San Diego
>>>>>>>>>>>> (+1) 858.246.0526
>>>>>>>>>>>> -----------------------------------------------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> biojava-dev mailing list
>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list