[Biojava-l] Tr: Retrieve Information from GenBank file

jc.lucky jc.lucky at laposte.net
Wed Oct 27 13:34:22 UTC 2010


Thanks for your reply and indeed as mentioned at the bottom that is what I use to try to retrieve the maximum of information. However and that is my problem the methods described do not provide the required information.
For example getRankedDocRefs() provides authors and Journals but no TITLE
getFeaturesSet() only provides /organism, /mol_type and /db_xref
Thereby I was asking for help and suggestion fo how to fix this "problem".

Best,
Jean-Charles


> Message du 27/10/10 15:17
> De : "Richard Holland" 
> A : "jc.lucky" 
> Copie à : "Scooter Willis" , "biojava-l lists open-bio org" 
> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>
> 
> Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().
> 
> This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2
> 
> cheers,
> Richard
> 
> On 27 Oct 2010, at 14:03, jc.lucky wrote:
> 
> > 
> > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> > 
> > Thanks,
> > 
> > Jean-Charles
> > 
> > 
> > 
> >> Message du 27/10/10 12:41
> >> De : "Scooter Willis" 
> >> A : "jc.lucky" 
> >> Copie à : "biojava-l lists open-bio org" 
> >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
> >> 
> >> Jean-Charles
> >> 
> >> I have it on my list to do a GenBank parser but haven't had the time. I
> >> can't promise anything in the next couple weeks. Can you send some details
> >> about what a typical use case is for your purpose? Are you trying to get the
> >> sequence data or are you more interested in the features?
> >> 
> >> Thanks
> >> 
> >> Scooter
> >> 
> >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote:
> >> 
> >>> 
> >>> I tried once again with the new version of BioJava but without succeding.
> >>> Any idea or suggestion?
> >>> 
> >>> Thanks in advance
> >>> Regards,
> >>> 
> >>> Jean-Charles Ferrières
> >>> 
> >>> 
> >>>> Message du 22/10/10 10:11
> >>>> De : "jc.lucky"
> >>>> A : biojava-l at lists.open-bio.org
> >>>> Copie à :
> >>>> Objet : [Biojava-l] Retrieve Information from GenBank file
> >>>> 
> >>>> 
> >>>> Hi
> >>>> 
> >>>> I'm trying to convert a GenBank file into a rdf file. The gene of
> >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> >>>> 
> >>>> With the below code I can read the GenBank file and I manage to retrieve
> >>> information and convert them in a rdf format. However I don't succeed in
> >>> retrieving some information such as Title, protein or product. According to
> >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> >>> possible to do so.
> >>>> Please help me find what I do wrong or what should be done to achieve my
> >>> goal.
> >>>> 
> >>>> //read the GeneBank File
> >>>> public static RichSequenceIterator readFile(String input,
> >>>> RichSequenceBuilderFactory seqFactory,
> >>>> Namespace ns)
> >>>> throws IOException, NoSuchElementException, BioException
> >>>> {
> >>>> ns = null;
> >>>> InputStream stream = new FileInputStream(input);
> >>>> BufferedReader rdfFile = new BufferedReader(new
> >>> InputStreamReader(stream));
> >>>> RichSequenceIterator seqs =
> >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> >>>> return seqs;
> >>>> }
> >>>> 
> >>>> //Retrieve information and convert them in rdf format
> >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
> >>>> throws IOException, NoSuchElementException, BioException {
> >>>> //create model for the ontology
> >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> >>> null);
> >>>> OntClass parents;
> >>>> String URI = "http://pbr.wur.nl/#";
> >>>> 
> >>>> while(rsi.hasNext())
> >>>> {
> >>>> RichSequence seq = rsi.nextRichSequence();
> >>>> String id = seq.getName();
> >>>> parents = model.createClass(URI + id);
> >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> >>> toString
> >>>> String definition = seq.getDescription(); //code to clean up String
> >>>> //Add to model
> >>>> parents.addProperty(DC.description, definition);
> >>>> parents.addProperty(DC.publisher, authors);
> >>>> parents.addComment(taxonomy, "EN");
> >>>> parents.addProperty(DC.type, organism);
> >>>> //print in rdf format
> >>>> model.write(out, "RDF/XML");
> >>>> out.close(); }
> >>>> }
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> Jean-Charles Ferrières
> >>> _____________________________________________
> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> > 
> > Une messagerie gratuite, garantie à vie et des services en plus, ça vous tente ?
> > Je crée ma boîte mail www.laposte.net
> > 
> > 
> > _______________________________________________
> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> 

Une messagerie gratuite, garantie à vie et des services en plus, ça vous tente ?
Je crée ma boîte mail www.laposte.net





More information about the Biojava-l mailing list