[Biojava-l] GenBank parsing

Paolo Pavan paolo.pavan at gmail.com
Wed Jun 3 13:22:44 UTC 2015


Can't you find those information in the "source" feature? Check this list:
List l = sequence.getFeaturesByType("source");

This come from the fact that in new version of genbank file, source is a
compulsory feature and they move many info from top level "Features tag"
into "Source" tag qualifiers.

Let us know,
Paolo


2015-06-03 14:29 GMT+02:00 simon rayner <simon.rayner.cn at gmail.com>:

> Thanks to all for taking the time to answer.
>
> I had already got as far as parsing out the feature information using
> something like
>
> LinkedHashMap<String, DNASequence> dnaSequences =
> GenbankReaderHelper.readGenbankDNASequence( dnaFile );
> for (DNASequence sequence : dnaSequences.values()) {
>
>
> List<FeatureInterface<AbstractSequence<NucleotideCompound>,
> NucleotideCompound>> fl =   sequence.getFeatures();
>                 for (FeatureInterface fi : fl) {
>
>                     HashMap <String, Qualifier> quals = fi.getQualifiers();
>                     for(Map.Entry<String, Qualifier> entry :
> quals.entrySet()){
>                         logger.info("--\t" + entry.getKey() + "\t|\t" +
> entry.getValue().getName()
>                                 + "  /  " + entry.getValue().getValue() +
> "\\" + entry.getValue().toString());
>                     }
>                     logger.info("SHORT\t" + fi.getShortDescription());
>                     logger.info("SOURCE\t" + fi.getSource());
>                     logger.info("TYPE\t" + fi.getType());
>                     logger.info("HASHCODE\t" + fi.hashCode());
>                     logger.info("-");
>                 }
>
> }
>
> But I am still stumped as to how to access the annotation information at
> the top of a GenBank file.
>
> For example, getAccession gets me the accession number of the sequence,
> but what about all the other data that is there (e.g. the pubmed records)?
>
> In BJ3, there was a RichAnnotation class, but I don't see anything
> equivalent in BJ4.
>
> cheers
>
> Simon
>
>
>
> On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <paolo.pavan at gmail.com>
> wrote:
>
>> Hi Simon,
>> I took care about last updates to the Genbank parser (reader). At the
>> state of the art, there are two ways to read annotated Genbank files: via
>> GenbankReader and via GenbankProxySequenceReader .
>>
>> The first one:
>> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein
>>                 = new GenbankReader<ProteinSequence, AminoAcidCompound>(
>>                         inStream,
>>                         new GenericGenbankHeaderParser<ProteinSequence,
>> AminoAcidCompound>(),
>>                         new
>> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
>>                 );
>> LinkedHashMap<String, ProteinSequence> proteinSequences =
>> GenbankProtein.process();
>>         inStream.close();
>>
>>
>> The second one is:
>>
>> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
>>                 = new
>> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", "NP_000257",
>> AminoAcidCompoundSet.getAminoAcidCompoundSet());
>>         ProteinSequence proteinSequence = new
>> ProteinSequence(genbankProteinReader);
>>
>>
>> Just keep in mind to use NucleotideCompound and a
>> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to parse
>> genbank nucleotide files.
>>
>> You can access annotation stored via getFeatures() methods family of the
>> readed sequence object. Also note that features have qualifiers (those
>> starting with / in the genbank file) and they must be accessed from the
>> feature object with getQualifiers().
>> Also note that feature can have complex locations (rare, but present) in
>> this case you will find nested locations in the feature retrieved.
>>
>> Does this answer your question?
>> Bye bye,
>> Paolo
>>
>>
>>
>>
>>
>>
>> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <jose.duarte at psi.ch>:
>>
>>> I can't offer much help regarding GenBank parsing itself, but I would at
>>> least like to clarify the situation with the different (indeed confusing)
>>> versions:
>>>
>>> BJ4 is the current release, well maintained and under development. BJ3
>>> has been completely superseded by BJ4. That means that BJ4 does everything
>>> that BJ3 did. In the cookbook and tutorials everything that refers to BJ3
>>> should work in BJ4, with the only difference that the namespace of packages
>>> has changed from org.biojava.bio/org.biojava3 to org.biojava.nbio.
>>>
>>> BJ1 and BJX are both legacy projects, with some maintenance but not much
>>> active development. I believe that some of the features in them were not
>>> ported to BJ3+.
>>>
>>> Cheers
>>>
>>> Jose
>>>
>>>
>>>
>>> On 02.06.2015 11:40, Simon Rayner wrote:
>>>
>>>> Hi
>>>>
>>>> I'm coming back to BioJava (BJ) after a couple of years away and am
>>>> somewhat confused by the current collection of cookbooks, tutorials and
>>>> APIs. There appear to be a few examples for handling protein structure
>>>> data, but relatively little for more mainstream stuff such as parsing
>>>> Genbank files, which I first need to get the information I want to
>>>> investigate protein structure. But when I look at the relevant code samples
>>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki page
>>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015.
>>>>
>>>> I have everything working for parsing GenBank data, but I'm still
>>>> trying to get the Annotation information out of the top of a GenBank file,
>>>> and can't find any way of doing this using BJ4 - the BJ4 API appears to
>>>> refer to the RichAnnotation type in BJX release. Can anyone clarify what
>>>> you are supposed to do here? Start mixing in some BJX? (and is BJX still
>>>> active?) or should I still be using BJ3 until BJ4 stabilizes. I realise
>>>> this is an open source project, but some clarification on the current
>>>> status of things would be handy if the project is going to appeal to a
>>>> larger community :)
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150603/66977370/attachment-0001.html>


More information about the Biojava-l mailing list