[Biojava-l] Tr: Retrieve Information from GenBank file

Richard Holland holland at eaglegenomics.com
Wed Oct 27 13:16:56 UTC 2010


Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().

This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2

cheers,
Richard

On 27 Oct 2010, at 14:03, jc.lucky wrote:

> 
> I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> 
> Thanks,
> 
> Jean-Charles
> 
> 
> 
>> Message du 27/10/10 12:41
>> De : "Scooter Willis" 
>> A : "jc.lucky" 
>> Copie à : "biojava-l lists open-bio org" 
>> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>> 
>> Jean-Charles
>> 
>> I have it on my list to do a GenBank parser but haven't had the time. I
>> can't promise anything in the next couple weeks. Can you send some details
>> about what a typical use case is for your purpose? Are you trying to get the
>> sequence data or are you more interested in the features?
>> 
>> Thanks
>> 
>> Scooter
>> 
>> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky  wrote:
>> 
>>> 
>>> I tried once again with the new version of BioJava but without succeding.
>>> Any idea or suggestion?
>>> 
>>> Thanks in advance
>>> Regards,
>>> 
>>> Jean-Charles Ferrières
>>> 
>>> 
>>>> Message du 22/10/10 10:11
>>>> De : "jc.lucky"
>>>> A : biojava-l at lists.open-bio.org
>>>> Copie à :
>>>> Objet : [Biojava-l] Retrieve Information from GenBank file
>>>> 
>>>> 
>>>> Hi
>>>> 
>>>> I'm trying to convert a GenBank file into a rdf file. The gene of
>>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
>>>> 
>>>> With the below code I can read the GenBank file and I manage to retrieve
>>> information and convert them in a rdf format. However I don't succeed in
>>> retrieving some information such as Title, protein or product. According to
>>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
>>> possible to do so.
>>>> Please help me find what I do wrong or what should be done to achieve my
>>> goal.
>>>> 
>>>> //read the GeneBank File
>>>> public static RichSequenceIterator readFile(String input,
>>>> RichSequenceBuilderFactory seqFactory,
>>>> Namespace ns)
>>>> throws IOException, NoSuchElementException, BioException
>>>> {
>>>> ns = null;
>>>> InputStream stream = new FileInputStream(input);
>>>> BufferedReader rdfFile = new BufferedReader(new
>>> InputStreamReader(stream));
>>>> RichSequenceIterator seqs =
>>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
>>>> return seqs;
>>>> }
>>>> 
>>>> //Retrieve information and convert them in rdf format
>>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
>>>> throws IOException, NoSuchElementException, BioException {
>>>> //create model for the ontology
>>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
>>> null);
>>>> OntClass parents;
>>>> String URI = "http://pbr.wur.nl/#";
>>>> 
>>>> while(rsi.hasNext())
>>>> {
>>>> RichSequence seq = rsi.nextRichSequence();
>>>> String id = seq.getName();
>>>> parents = model.createClass(URI + id);
>>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
>>> toString
>>>> String definition = seq.getDescription(); //code to clean up String
>>>> //Add to model
>>>> parents.addProperty(DC.description, definition);
>>>> parents.addProperty(DC.publisher, authors);
>>>> parents.addComment(taxonomy, "EN");
>>>> parents.addProperty(DC.type, organism);
>>>> //print in rdf format
>>>> model.write(out, "RDF/XML");
>>>> out.close(); }
>>>> }
>>>> 
>>>> 
>>>> Thanks,
>>>> Jean-Charles Ferrières
>>> _____________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> Une messagerie gratuite, garantie à vie et des services en plus, ça vous tente ?
> Je crée ma boîte mail www.laposte.net
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the Biojava-l mailing list