[Biojava-l] Genbank file parser error

Richard Holland holland at eaglegenomics.com
Thu Jan 29 07:25:10 UTC 2009


Gabrielle Doan posted a solution to this a while back and I believe the
changes have been committed already:

http://www.mail-archive.com/biojava-l@lists.open-bio.org/msg01036.html

How old is the copy of BioJava that you're using? Have you tried
checking out the trunk from Subversion to see if that works?

cheers,
Richard

Mark Schreiber wrote:
> I assume that the downloaded file has the complete sequence in it? Probably
> worth checking that it has the complete sequence block (all 116366104 bp).
> 
> - Mark
> 
> On Thu, Jan 29, 2009 at 12:51 PM, gang wu <gwu at molbio.mgh.harvard.edu>wrote:
> 
>> Hi Everyone,
>>
>> I have a piece of code to parse Genbank file and retrieve gene sequence and
>> related information. It works well with sequences such as Arabidopsis
>> thaliana, C. elegans, Bos taurus. But it failed with Mus musculus chromosome
>> 2. The contig that the code failed on is the largest one in my test. Contig
>> NT_039207 has 116366104 bp, but the code shows it's cut to 100000020 bp.
>> That causes some gene coordinates out of range. Attached is the code. Can
>> anyone give some suggesttion?
>>
>> The Mus musculus Genbank file can be downloaded at :
>> ftp://ftp.ncbi.nih.gov/genomes/M_musculus/CHR_02/mm_alt_chr2.gbk.gz
>>
>> Thanks in advance
>>
>> Gang
>> ==========================================
>> public class TestMus {
>>   public void testMusChr2() throws FileNotFoundException,
>> NoSuchElementException, BioException {
>>       String fp="/tmp/mm_alt_chr2.gbk";
>>       System.out.println("File: " + fp);
>>       BufferedReader gReader = new BufferedReader(new InputStreamReader(new
>> FileInputStream(new File(fp))));
>>       Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>>       RichSequenceIterator seqI =
>> RichSequence.IOTools.readGenbankDNA(gReader, ns);
>>       while (seqI.hasNext()) {
>>           RichSequence seq = seqI.nextRichSequence();
>>           String organism = seq.getTaxon().getDisplayName();
>>           String accession = seq.getAccession();
>>           String identifier = seq.getIdentifier();
>>           int taxonID = seq.getTaxon().getNCBITaxID();
>>           String division = seq.getDivision();
>>           String seqVersion = "" + seq.getSeqVersion();
>>           int seqLength = seq.length();
>>           String description = seq.getDescription();
>>           System.out.println("Organism: " + organism
>>                   + "\nAccession: " + accession
>>                   + "\nIdentifier: " + identifier
>>                   + "\nTaxonID: " + taxonID
>>                   + "\nDivision: " + division
>>                   + "\nSeqVersion: " + seqVersion
>>                   + "\nLength: " + seqLength);
>>           System.out.println("2041-2101: " + seq.subStr(2041, 2101));
>>           for (Iterator i = seq.features(); i.hasNext();) {
>>               RichFeature f = (RichFeature) i.next();
>>               int rank = f.getRank();
>>               String fType = f.getType();
>>               if (fType.toLowerCase().equals("gene")) {
>>                   int startPos=f.getLocation().getMin();
>>                   int endPos=f.getLocation().getMax();
>>                   int geneLen=endPos-startPos+1;
>>                   String sequence=seq.subStr(startPos, endPos);
>>                   String strand = f.getStrand().getToken() + "";
>>                   Annotation ann = (Annotation) f.getAnnotation();
>>                   String geneIdentifier ="";
>>                   if (ann.containsProperty("locus_tag")) {
>>                       geneIdentifier=ann.getProperty("locus_tag") + "";
>>                   }
>>                   else geneIdentifier=ann.getProperty("gene") + "";
>>
>>                   String alternativeIdentifiers="";
>>                   try {
>>                       alternativeIdentifiers= (String)
>> ann.getProperty("gene");
>>
>>                   } catch(NoSuchElementException e) {}
>>                   String annotation="";
>>                   System.out.println(rank + "\t" + geneIdentifier + "\t" +
>> alternativeIdentifiers + "\t"
>>                           + startPos + "\t" + endPos + "\t" + geneLen +
>> "\t" + strand);
>>               }
>>           }
>>       }
>>   }
>>   public static void main(String [] args) throws Exception {
>>      TestMus tm=new TestMus();
>>       tm.testMusChr2();
>>   }
>> }
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the Biojava-l mailing list