[Biojava-l] Fwd: GenBank parsing

Thu Jun 4 12:49:46 UTC 2015

Hi Paolo and Simon,

Would it be possible to update the documentation in the tutorial, so other
users can see how to retrieve features as well? Also, perhaps file a ticket
for the missing keywords, etc. fields , so this does not get lost?

https://github.com/biojava/biojava-tutorial/blob/master/genomics/genebank.md

Thanks!

Andreas

On Thu, Jun 4, 2015 at 2:57 AM, simon rayner <simon.rayner.cn at gmail.com>
wrote:

> We resolved things, sort of, But at some point we fell off the mailing
> list. Here is the full message chain
>
> thanks again to all for the help
>
> Andreas, repeating my question here,  would it be any use if I added a
> more complete code sample to the tutorial show how to pull the Feature
> information out of a GenBank file?
>
> cheers
>
> Simon
>
>
> ---------- Forwarded message ----------
> From: Paolo Pavan <paolo.pavan at gmail.com>
> Date: Wed, Jun 3, 2015 at 5:34 PM
> Subject: Re: [Biojava-l] GenBank parsing
> To: simon rayner <simon.rayner.cn at gmail.com>
>
>
> Oh, I'm realizing now that we went outside of the mailing list.
> You can forward all the conversation to the list and ask for Andreas there.
>
> Paolo
>
> 2015-06-03 17:29 GMT+02:00 Paolo Pavan <paolo.pavan at gmail.com>:
>
>> Simon,
>> As far as I  have read on the mailing list, I know that Andreas Prlic is
>> interested in this kind of collaborations. I think he will answer you
>> shortly.
>>
>> Bye bye!
>>
>> 2015-06-03 17:14 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>
>>> Hi Paolo
>>>
>>> I think its okay. For now, perhaps it would be good to clarify this
>>> somewhere (perhaps in the tutorial sample?). And would it be any use if I
>>> added a more complete code sample to the tutorial show how to pull the
>>> Feature information out of a GenBank file?
>>>
>>> Simon
>>>
>>> On Wed, Jun 3, 2015 at 5:11 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>> wrote:
>>>
>>>> Hi Simon,
>>>> Now I see what you mean and unfortunately I must say that those
>>>> retrieval are not supported yet. They aren't in the section I put my hands
>>>> on and I must say that I wasn't actually aware of that.
>>>>
>>>> The file responsible for this behaviour is GenbankSequenceParser.java,
>>>> I don't know if there are someone of the original authors out of there that
>>>> can add something.
>>>>
>>>> You are unlucky, let me know if I can be of any help more.
>>>> Paolo
>>>>
>>>> 2015-06-03 15:55 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>>>
>>>>> Hi Paolo
>>>>>
>>>>>  sequence.getFeaturesByType("source");
>>>>>
>>>>> will return the 'source' entry at the top of the FEATURE tree, but it
>>>>> won't help me retrieve anything outside the FEATURE tree (from the top of
>>>>> the file and at the bottom before the sequence)
>>>>>
>>>>> For example, in the following GenBank file
>>>>>
>>>>> LOCUS       AY102993                 400 bp    mRNA    linear   VRL 22-FEB-2006
>>>>> DEFINITION  Rabies virus isolate RV61 nucleoprotein mRNA, partial cds.
>>>>> ACCESSION   AY102993 AY247649
>>>>> VERSION     AY102993.2  GI:34099643
>>>>> KEYWORDS    .
>>>>> SOURCE      Rabies virus
>>>>>   ORGANISM  Rabies virus <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>
>>>>>             Viruses; ssRNA viruses; ssRNA negative-strand viruses;
>>>>>             Mononegavirales; Rhabdoviridae; Lyssavirus.
>>>>> REFERENCE   1  (bases 1 to 400)
>>>>>   AUTHORS   Smith,J., McElhinney,L., Parsons,G., Brink,N., Doherty,T.,
>>>>>             Agranoff,D., Miranda,M.E. and Fooks,A.R.
>>>>>   TITLE     Case report: rapid ante-mortem diagnosis of a human case of rabies
>>>>>             imported into the UK from the Philippines
>>>>>   JOURNAL   J. Med. Virol. 69 (1), 150-155 (2003)
>>>>>    PUBMED   12436491 <http://www.ncbi.nlm.nih.gov/pubmed/12436491>
>>>>> REFERENCE   2  (bases 1 to 400)
>>>>>
>>>>>      .
>>>>>      .
>>>>>      .
>>>>>
>>>>> COMMENT     On Aug 22, 2003 this sequence version replaced gi:25986720 <http://www.ncbi.nlm.nih.gov/nuccore/25986720>.FEATURES             Location/Qualifiers     source          1..400
>>>>>                      /organism="Rabies virus"
>>>>>                      /mol_type="mRNA"
>>>>>                      /isolate="RV61"
>>>>>                      /host="Homo sapiens"
>>>>>                      /db_xref="taxon:11292 <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>                      /country="United Kingdom"
>>>>>                      /note="isolated in 1987"     CDS <http://www.ncbi.nlm.nih.gov/nuccore/34099643?from=1&to=400&sat=4&sat_key=38832925>             1..>400
>>>>>
>>>>>
>>>>> sequence.getFeaturesByType("source");
>>>>>
>>>>> will return the portion
>>>>>
>>>>>      source          1..400
>>>>>                      /organism="Rabies virus"
>>>>>                      /mol_type="mRNA"
>>>>>                      /isolate="RV61"
>>>>>                      /host="Homo sapiens"
>>>>>                      /db_xref="taxon:11292 <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>                      /country="United Kingdom"
>>>>>                      /note="isolated in 1987"
>>>>>
>>>>>
>>>>>
>>>>> which is important data, but what about the KEYWORDS, SOURCE and
>>>>> REFERENCE information at the  top and COMMENT at the bottom?
>>>>>
>>>>> I can use the following calls to get some information
>>>>>
>>>>> getOriginalHeader() -> LOCUS
>>>>> getDescription() -> DEFINITION
>>>>> getAccession() -> ACCESSION
>>>>>
>>>>> What am I missing here?
>>>>>
>>>>> thanks
>>>>>
>>>>> Simon
>>>>>
>>>>> On Wed, Jun 3, 2015 at 3:22 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Can't you find those information in the "source" feature? Check this
>>>>>> list:
>>>>>> List l = sequence.getFeaturesByType("source");
>>>>>>
>>>>>> This come from the fact that in new version of genbank file, source
>>>>>> is a compulsory feature and they move many info from top level "Features
>>>>>> tag" into "Source" tag qualifiers.
>>>>>>
>>>>>> Let us know,
>>>>>> Paolo
>>>>>>
>>>>>>
>>>>>> 2015-06-03 14:29 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:
>>>>>>
>>>>>>> Thanks to all for taking the time to answer.
>>>>>>>
>>>>>>> I had already got as far as parsing out the feature information
>>>>>>> using something like
>>>>>>>
>>>>>>> LinkedHashMap<String, DNASequence> dnaSequences =
>>>>>>> GenbankReaderHelper.readGenbankDNASequence( dnaFile );
>>>>>>> for (DNASequence sequence : dnaSequences.values()) {
>>>>>>>
>>>>>>>
>>>>>>> List<FeatureInterface<AbstractSequence<NucleotideCompound>,
>>>>>>> NucleotideCompound>> fl =   sequence.getFeatures();
>>>>>>>                 for (FeatureInterface fi : fl) {
>>>>>>>
>>>>>>>                     HashMap <String, Qualifier> quals =
>>>>>>> fi.getQualifiers();
>>>>>>>                     for(Map.Entry<String, Qualifier> entry :
>>>>>>> quals.entrySet()){
>>>>>>>                         logger.info("--\t" + entry.getKey() +
>>>>>>> "\t|\t" + entry.getValue().getName()
>>>>>>>                                 + "  /  " +
>>>>>>> entry.getValue().getValue() + "\\" + entry.getValue().toString());
>>>>>>>
>>>>>>>                     }
>>>>>>>                     logger.info("SHORT\t" +
>>>>>>> fi.getShortDescription());
>>>>>>>                     logger.info("SOURCE\t" + fi.getSource());
>>>>>>>                     logger.info("TYPE\t" + fi.getType());
>>>>>>>                     logger.info("HASHCODE\t" + fi.hashCode());
>>>>>>>                     logger.info("-");
>>>>>>>                 }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> But I am still stumped as to how to access the annotation
>>>>>>> information at the top of a GenBank file.
>>>>>>>
>>>>>>> For example, getAccession gets me the accession number of the
>>>>>>> sequence, but what about all the other data that is there (e.g. the pubmed
>>>>>>> records)?
>>>>>>>
>>>>>>> In BJ3, there was a RichAnnotation class, but I don't see anything
>>>>>>> equivalent in BJ4.
>>>>>>>
>>>>>>> cheers
>>>>>>>
>>>>>>> Simon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Simon,
>>>>>>>> I took care about last updates to the Genbank parser (reader). At
>>>>>>>> the state of the art, there are two ways to read annotated Genbank files: via
>>>>>>>> GenbankReader and via GenbankProxySequenceReader .
>>>>>>>>
>>>>>>>> The first one:
>>>>>>>> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein
>>>>>>>>                 = new GenbankReader<ProteinSequence,
>>>>>>>> AminoAcidCompound>(
>>>>>>>>                         inStream,
>>>>>>>>                         new
>>>>>>>> GenericGenbankHeaderParser<ProteinSequence, AminoAcidCompound>(),
>>>>>>>>                         new
>>>>>>>> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
>>>>>>>>                 );
>>>>>>>> LinkedHashMap<String, ProteinSequence> proteinSequences =
>>>>>>>> GenbankProtein.process();
>>>>>>>>         inStream.close();
>>>>>>>>
>>>>>>>>
>>>>>>>> The second one is:
>>>>>>>>
>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
>>>>>>>>                 = new
>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", "NP_000257",
>>>>>>>> AminoAcidCompoundSet.getAminoAcidCompoundSet());
>>>>>>>>         ProteinSequence proteinSequence = new
>>>>>>>> ProteinSequence(genbankProteinReader);
>>>>>>>>
>>>>>>>>
>>>>>>>> Just keep in mind to use NucleotideCompound and a
>>>>>>>> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to parse
>>>>>>>> genbank nucleotide files.
>>>>>>>>
>>>>>>>> You can access annotation stored via getFeatures() methods family
>>>>>>>> of the readed sequence object. Also note that features have qualifiers
>>>>>>>> (those starting with / in the genbank file) and they must be accessed from
>>>>>>>> the feature object with getQualifiers().
>>>>>>>> Also note that feature can have complex locations (rare, but
>>>>>>>> present) in this case you will find nested locations in the feature
>>>>>>>> retrieved.
>>>>>>>>
>>>>>>>> Does this answer your question?
>>>>>>>> Bye bye,
>>>>>>>> Paolo
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <jose.duarte at psi.ch>:
>>>>>>>>
>>>>>>>>> I can't offer much help regarding GenBank parsing itself, but I
>>>>>>>>> would at least like to clarify the situation with the different (indeed
>>>>>>>>> confusing) versions:
>>>>>>>>>
>>>>>>>>> BJ4 is the current release, well maintained and under development.
>>>>>>>>> BJ3 has been completely superseded by BJ4. That means that BJ4 does
>>>>>>>>> everything that BJ3 did. In the cookbook and tutorials everything that
>>>>>>>>> refers to BJ3 should work in BJ4, with the only difference that the
>>>>>>>>> namespace of packages has changed from org.biojava.bio/org.biojava3 to
>>>>>>>>> org.biojava.nbio.
>>>>>>>>>
>>>>>>>>> BJ1 and BJX are both legacy projects, with some maintenance but
>>>>>>>>> not much active development. I believe that some of the features in them
>>>>>>>>> were not ported to BJ3+.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> Jose
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02.06.2015 11:40, Simon Rayner wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I'm coming back to BioJava (BJ) after a couple of years away and
>>>>>>>>>> am somewhat confused by the current collection of cookbooks, tutorials and
>>>>>>>>>> APIs. There appear to be a few examples for handling protein structure
>>>>>>>>>> data, but relatively little for more mainstream stuff such as parsing
>>>>>>>>>> Genbank files, which I first need to get the information I want to
>>>>>>>>>> investigate protein structure. But when I look at the relevant code samples
>>>>>>>>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki page
>>>>>>>>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015.
>>>>>>>>>>
>>>>>>>>>> I have everything working for parsing GenBank data, but I'm still
>>>>>>>>>> trying to get the Annotation information out of the top of a GenBank file,
>>>>>>>>>> and can't find any way of doing this using BJ4 - the BJ4 API appears to
>>>>>>>>>> refer to the RichAnnotation type in BJX release. Can anyone clarify what
>>>>>>>>>> you are supposed to do here? Start mixing in some BJX? (and is BJX still
>>>>>>>>>> active?) or should I still be using BJ3 until BJ4 stabilizes. I realise
>>>>>>>>>> this is an open source project, but some clarification on the current
>>>>>>>>>> status of things would be handy if the project is going to appeal to a
>>>>>>>>>> larger community :)
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150604/9970e92b/attachment-0001.html>