[Biojava-l] Genbank file parser error

gwu gwu at molbio.mgh.harvard.edu
Thu Jan 29 18:40:06 UTC 2009


Thanks Mark. I did parse out the sequence block with sed and the length 
agrees with what the Genbank says.

Gang

Mark Schreiber wrote:
> I assume that the downloaded file has the complete sequence in it? 
> Probably worth checking that it has the complete sequence block (all 
> 116366104 bp).
>  
> - Mark
>
> On Thu, Jan 29, 2009 at 12:51 PM, gang wu <gwu at molbio.mgh.harvard.edu 
> <mailto:gwu at molbio.mgh.harvard.edu>> wrote:
>
>     Hi Everyone,
>
>     I have a piece of code to parse Genbank file and retrieve gene
>     sequence and related information. It works well with sequences
>     such as Arabidopsis thaliana, C. elegans, Bos taurus. But it
>     failed with Mus musculus chromosome 2. The contig that the code
>     failed on is the largest one in my test. Contig NT_039207 has
>     116366104 bp, but the code shows it's cut to 100000020 bp. That
>     causes some gene coordinates out of range. Attached is the code.
>     Can anyone give some suggesttion?
>
>     The Mus musculus Genbank file can be downloaded at :
>     ftp://ftp.ncbi.nih.gov/genomes/M_musculus/CHR_02/mm_alt_chr2.gbk.gz
>
>     Thanks in advance
>
>     Gang
>     ==========================================
>     public class TestMus {
>       public void testMusChr2() throws FileNotFoundException,
>     NoSuchElementException, BioException {
>           String fp="/tmp/mm_alt_chr2.gbk";
>           System.out.println("File: " + fp);
>           BufferedReader gReader = new BufferedReader(new
>     InputStreamReader(new FileInputStream(new File(fp))));
>           Namespace ns = (Namespace)
>     RichObjectFactory.getDefaultNamespace();
>           RichSequenceIterator seqI =
>     RichSequence.IOTools.readGenbankDNA(gReader, ns);
>           while (seqI.hasNext()) {
>               RichSequence seq = seqI.nextRichSequence();
>               String organism = seq.getTaxon().getDisplayName();
>               String accession = seq.getAccession();
>               String identifier = seq.getIdentifier();
>               int taxonID = seq.getTaxon().getNCBITaxID();
>               String division = seq.getDivision();
>               String seqVersion = "" + seq.getSeqVersion();
>               int seqLength = seq.length();
>               String description = seq.getDescription();
>               System.out.println("Organism: " + organism
>                       + "\nAccession: " + accession
>                       + "\nIdentifier: " + identifier
>                       + "\nTaxonID: " + taxonID
>                       + "\nDivision: " + division
>                       + "\nSeqVersion: " + seqVersion
>                       + "\nLength: " + seqLength);
>               System.out.println("2041-2101: " + seq.subStr(2041, 2101));
>               for (Iterator i = seq.features(); i.hasNext();) {
>                   RichFeature f = (RichFeature) i.next();
>                   int rank = f.getRank();
>                   String fType = f.getType();
>                   if (fType.toLowerCase().equals("gene")) {
>                       int startPos=f.getLocation().getMin();
>                       int endPos=f.getLocation().getMax();
>                       int geneLen=endPos-startPos+1;
>                       String sequence=seq.subStr(startPos, endPos);
>                       String strand = f.getStrand().getToken() + "";
>                       Annotation ann = (Annotation) f.getAnnotation();
>                       String geneIdentifier ="";
>                       if (ann.containsProperty("locus_tag")) {
>                           geneIdentifier=ann.getProperty("locus_tag")
>     + "";
>                       }
>                       else geneIdentifier=ann.getProperty("gene") + "";
>
>                       String alternativeIdentifiers="";
>                       try {
>                           alternativeIdentifiers= (String)
>     ann.getProperty("gene");
>
>                       } catch(NoSuchElementException e) {}
>                       String annotation="";
>                       System.out.println(rank + "\t" + geneIdentifier
>     + "\t" + alternativeIdentifiers + "\t"
>                               + startPos + "\t" + endPos + "\t" +
>     geneLen + "\t" + strand);
>                   }
>               }
>           }
>       }
>       public static void main(String [] args) throws Exception {
>          TestMus tm=new TestMus();
>           tm.testMusChr2();
>       }
>     }
>     _______________________________________________
>     Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>




More information about the Biojava-l mailing list