[Biojava-l] Fwd: GenBank parsing

Tue Jun 9 06:49:36 UTC 2015

Sure,

I'm up for that. I'm just trying to look at the code on GitHub to try and
get a feel for the project. Then I might be able to better contribute

Simon

On Thu, Jun 4, 2015 at 2:49 PM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi Paolo and Simon,
>
> Would it be possible to update the documentation in the tutorial, so other
> users can see how to retrieve features as well? Also, perhaps file a ticket
> for the missing keywords, etc. fields , so this does not get lost?
>
>
> https://github.com/biojava/biojava-tutorial/blob/master/genomics/genebank.md
>
> Thanks!
>
> Andreas
>
> On Thu, Jun 4, 2015 at 2:57 AM, simon rayner <simon.rayner.cn at gmail.com>
> wrote:
>
>> We resolved things, sort of, But at some point we fell off the mailing
>> list. Here is the full message chain
>>
>> thanks again to all for the help
>>
>> Andreas, repeating my question here,  would it be any use if I added a
>> more complete code sample to the tutorial show how to pull the Feature
>> information out of a GenBank file?
>>
>> cheers
>>
>> Simon
>>
>>
>> ---------- Forwarded message ----------
>> From: Paolo Pavan <paolo.pavan at gmail.com>
>> Date: Wed, Jun 3, 2015 at 5:34 PM
>> Subject: Re: [Biojava-l] GenBank parsing
>> To: simon rayner <simon.rayner.cn at gmail.com>
>>
>>
>> Oh, I'm realizing now that we went outside of the mailing list.
>> You can forward all the conversation to the list and ask for Andreas
>> there.
>>
>> Paolo
>>
>> 2015-06-03 17:29 GMT+02:00 Paolo Pavan <paolo.pavan at gmail.com>:
>>
>>> Simon,
>>> As far as I  have read on the mailing list, I know that Andreas Prlic is
>>> interested in this kind of collaborations. I think he will answer you
>>> shortly.
>>>
>>> Bye bye!
>>>
>>> 2015-06-03 17:14 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>>
>>>> Hi Paolo
>>>>
>>>> I think its okay. For now, perhaps it would be good to clarify this
>>>> somewhere (perhaps in the tutorial sample?). And would it be any use if I
>>>> added a more complete code sample to the tutorial show how to pull the
>>>> Feature information out of a GenBank file?
>>>>
>>>> Simon
>>>>
>>>> On Wed, Jun 3, 2015 at 5:11 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Simon,
>>>>> Now I see what you mean and unfortunately I must say that those
>>>>> retrieval are not supported yet. They aren't in the section I put my hands
>>>>> on and I must say that I wasn't actually aware of that.
>>>>>
>>>>> The file responsible for this behaviour is GenbankSequenceParser.java,
>>>>> I don't know if there are someone of the original authors out of there that
>>>>> can add something.
>>>>>
>>>>> You are unlucky, let me know if I can be of any help more.
>>>>> Paolo
>>>>>
>>>>> 2015-06-03 15:55 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>>>>
>>>>>> Hi Paolo
>>>>>>
>>>>>>  sequence.getFeaturesByType("source");
>>>>>>
>>>>>> will return the 'source' entry at the top of the FEATURE tree, but it
>>>>>> won't help me retrieve anything outside the FEATURE tree (from the top of
>>>>>> the file and at the bottom before the sequence)
>>>>>>
>>>>>> For example, in the following GenBank file
>>>>>>
>>>>>> LOCUS       AY102993                 400 bp    mRNA    linear   VRL 22-FEB-2006
>>>>>> DEFINITION  Rabies virus isolate RV61 nucleoprotein mRNA, partial cds.
>>>>>> ACCESSION   AY102993 AY247649
>>>>>> VERSION     AY102993.2  GI:34099643
>>>>>> KEYWORDS    .
>>>>>> SOURCE      Rabies virus
>>>>>>   ORGANISM  Rabies virus <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>
>>>>>>             Viruses; ssRNA viruses; ssRNA negative-strand viruses;
>>>>>>             Mononegavirales; Rhabdoviridae; Lyssavirus.
>>>>>> REFERENCE   1  (bases 1 to 400)
>>>>>>   AUTHORS   Smith,J., McElhinney,L., Parsons,G., Brink,N., Doherty,T.,
>>>>>>             Agranoff,D., Miranda,M.E. and Fooks,A.R.
>>>>>>   TITLE     Case report: rapid ante-mortem diagnosis of a human case of rabies
>>>>>>             imported into the UK from the Philippines
>>>>>>   JOURNAL   J. Med. Virol. 69 (1), 150-155 (2003)
>>>>>>    PUBMED   12436491 <http://www.ncbi.nlm.nih.gov/pubmed/12436491>
>>>>>> REFERENCE   2  (bases 1 to 400)
>>>>>>
>>>>>>      .
>>>>>>      .
>>>>>>      .
>>>>>>
>>>>>> COMMENT     On Aug 22, 2003 this sequence version replaced gi:25986720 <http://www.ncbi.nlm.nih.gov/nuccore/25986720>.FEATURES             Location/Qualifiers     source          1..400
>>>>>>                      /organism="Rabies virus"
>>>>>>                      /mol_type="mRNA"
>>>>>>                      /isolate="RV61"
>>>>>>                      /host="Homo sapiens"
>>>>>>                      /db_xref="taxon:11292 <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>>                      /country="United Kingdom"
>>>>>>                      /note="isolated in 1987"     CDS <http://www.ncbi.nlm.nih.gov/nuccore/34099643?from=1&to=400&sat=4&sat_key=38832925>             1..>400
>>>>>>
>>>>>>
>>>>>> sequence.getFeaturesByType("source");
>>>>>>
>>>>>> will return the portion
>>>>>>
>>>>>>      source          1..400
>>>>>>                      /organism="Rabies virus"
>>>>>>                      /mol_type="mRNA"
>>>>>>                      /isolate="RV61"
>>>>>>                      /host="Homo sapiens"
>>>>>>                      /db_xref="taxon:11292 <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>>                      /country="United Kingdom"
>>>>>>                      /note="isolated in 1987"
>>>>>>
>>>>>>
>>>>>>
>>>>>> which is important data, but what about the KEYWORDS, SOURCE and
>>>>>> REFERENCE information at the  top and COMMENT at the bottom?
>>>>>>
>>>>>> I can use the following calls to get some information
>>>>>>
>>>>>> getOriginalHeader() -> LOCUS
>>>>>> getDescription() -> DEFINITION
>>>>>> getAccession() -> ACCESSION
>>>>>>
>>>>>> What am I missing here?
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> Simon
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 3:22 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Can't you find those information in the "source" feature? Check this
>>>>>>> list:
>>>>>>> List l = sequence.getFeaturesByType("source");
>>>>>>>
>>>>>>> This come from the fact that in new version of genbank file, source
>>>>>>> is a compulsory feature and they move many info from top level "Features
>>>>>>> tag" into "Source" tag qualifiers.
>>>>>>>
>>>>>>> Let us know,
>>>>>>> Paolo
>>>>>>>
>>>>>>>
>>>>>>> 2015-06-03 14:29 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>>>>>>
>>>>>>>> Thanks to all for taking the time to answer.
>>>>>>>>
>>>>>>>> I had already got as far as parsing out the feature information
>>>>>>>> using something like
>>>>>>>>
>>>>>>>> LinkedHashMap<String, DNASequence> dnaSequences =
>>>>>>>> GenbankReaderHelper.readGenbankDNASequence( dnaFile );
>>>>>>>> for (DNASequence sequence : dnaSequences.values()) {
>>>>>>>>
>>>>>>>>
>>>>>>>> List<FeatureInterface<AbstractSequence<NucleotideCompound>,
>>>>>>>> NucleotideCompound>> fl =   sequence.getFeatures();
>>>>>>>>                 for (FeatureInterface fi : fl) {
>>>>>>>>
>>>>>>>>                     HashMap <String, Qualifier> quals =
>>>>>>>> fi.getQualifiers();
>>>>>>>>                     for(Map.Entry<String, Qualifier> entry :
>>>>>>>> quals.entrySet()){
>>>>>>>>                         logger.info("--\t" + entry.getKey() +
>>>>>>>> "\t|\t" + entry.getValue().getName()
>>>>>>>>                                 + "  /  " +
>>>>>>>> entry.getValue().getValue() + "\\" + entry.getValue().toString());
>>>>>>>>
>>>>>>>>                     }
>>>>>>>>                     logger.info("SHORT\t" +
>>>>>>>> fi.getShortDescription());
>>>>>>>>                     logger.info("SOURCE\t" + fi.getSource());
>>>>>>>>                     logger.info("TYPE\t" + fi.getType());
>>>>>>>>                     logger.info("HASHCODE\t" + fi.hashCode());
>>>>>>>>                     logger.info("-");
>>>>>>>>                 }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> But I am still stumped as to how to access the annotation
>>>>>>>> information at the top of a GenBank file.
>>>>>>>>
>>>>>>>> For example, getAccession gets me the accession number of the
>>>>>>>> sequence, but what about all the other data that is there (e.g. the pubmed
>>>>>>>> records)?
>>>>>>>>
>>>>>>>> In BJ3, there was a RichAnnotation class, but I don't see anything
>>>>>>>> equivalent in BJ4.
>>>>>>>>
>>>>>>>> cheers
>>>>>>>>
>>>>>>>> Simon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <paolo.pavan at gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Simon,
>>>>>>>>> I took care about last updates to the Genbank parser (reader). At
>>>>>>>>> the state of the art, there are two ways to read annotated Genbank files: via
>>>>>>>>> GenbankReader and via GenbankProxySequenceReader .
>>>>>>>>>
>>>>>>>>> The first one:
>>>>>>>>> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein
>>>>>>>>>                 = new GenbankReader<ProteinSequence,
>>>>>>>>> AminoAcidCompound>(
>>>>>>>>>                         inStream,
>>>>>>>>>                         new
>>>>>>>>> GenericGenbankHeaderParser<ProteinSequence, AminoAcidCompound>(),
>>>>>>>>>                         new
>>>>>>>>> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
>>>>>>>>>                 );
>>>>>>>>> LinkedHashMap<String, ProteinSequence> proteinSequences =
>>>>>>>>> GenbankProtein.process();
>>>>>>>>>         inStream.close();
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The second one is:
>>>>>>>>>
>>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
>>>>>>>>>                 = new
>>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", "NP_000257",
>>>>>>>>> AminoAcidCompoundSet.getAminoAcidCompoundSet());
>>>>>>>>>         ProteinSequence proteinSequence = new
>>>>>>>>> ProteinSequence(genbankProteinReader);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just keep in mind to use NucleotideCompound and a
>>>>>>>>> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to parse
>>>>>>>>> genbank nucleotide files.
>>>>>>>>>
>>>>>>>>> You can access annotation stored via getFeatures() methods family
>>>>>>>>> of the readed sequence object. Also note that features have qualifiers
>>>>>>>>> (those starting with / in the genbank file) and they must be accessed from
>>>>>>>>> the feature object with getQualifiers().
>>>>>>>>> Also note that feature can have complex locations (rare, but
>>>>>>>>> present) in this case you will find nested locations in the feature
>>>>>>>>> retrieved.
>>>>>>>>>
>>>>>>>>> Does this answer your question?
>>>>>>>>> Bye bye,
>>>>>>>>> Paolo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <jose.duarte at psi.ch>
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>>> I can't offer much help regarding GenBank parsing itself, but I
>>>>>>>>>> would at least like to clarify the situation with the different (indeed
>>>>>>>>>> confusing) versions:
>>>>>>>>>>
>>>>>>>>>> BJ4 is the current release, well maintained and under
>>>>>>>>>> development. BJ3 has been completely superseded by BJ4. That means that BJ4
>>>>>>>>>> does everything that BJ3 did. In the cookbook and tutorials everything that
>>>>>>>>>> refers to BJ3 should work in BJ4, with the only difference that the
>>>>>>>>>> namespace of packages has changed from org.biojava.bio/org.biojava3 to
>>>>>>>>>> org.biojava.nbio.
>>>>>>>>>>
>>>>>>>>>> BJ1 and BJX are both legacy projects, with some maintenance but
>>>>>>>>>> not much active development. I believe that some of the features in them
>>>>>>>>>> were not ported to BJ3+.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> Jose
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 02.06.2015 11:40, Simon Rayner wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I'm coming back to BioJava (BJ) after a couple of years away and
>>>>>>>>>>> am somewhat confused by the current collection of cookbooks, tutorials and
>>>>>>>>>>> APIs. There appear to be a few examples for handling protein structure
>>>>>>>>>>> data, but relatively little for more mainstream stuff such as parsing
>>>>>>>>>>> Genbank files, which I first need to get the information I want to
>>>>>>>>>>> investigate protein structure. But when I look at the relevant code samples
>>>>>>>>>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki page
>>>>>>>>>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015.
>>>>>>>>>>>
>>>>>>>>>>> I have everything working for parsing GenBank data, but I'm
>>>>>>>>>>> still trying to get the Annotation information out of the top of a GenBank
>>>>>>>>>>> file, and can't find any way of doing this using BJ4 - the BJ4 API appears
>>>>>>>>>>> to refer to the RichAnnotation type in BJX release. Can anyone clarify what
>>>>>>>>>>> you are supposed to do here? Start mixing in some BJX? (and is BJX still
>>>>>>>>>>> active?) or should I still be using BJ3 until BJ4 stabilizes. I realise
>>>>>>>>>>> this is an open source project, but some clarification on the current
>>>>>>>>>>> status of things would be handy if the project is going to appeal to a
>>>>>>>>>>> larger community :)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150609/c332d649/attachment-0001.html>