[Bioperl-l] parsing GenBank file

Torsten Seemann torsten.seemann at infotech.monash.edu.au
Wed May 5 07:48:55 UTC 2010


>      i have a huge GenBank file ( downloaded from RDP containing all
> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's linage (in ORGANISM).
> I am getting the output like:
> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
> Holophagales Holophagae "Acidobacteria" Bacteria Root
> This is the exact output i want, but i am missing lot of records (they are
> there in the genbank file but not in my output).
> I also got a warning during parsing:
> --------------------- WARNING ---------------------
> MSG: Unbalanced quote in:
> /db_xref="taxon:35783" /germline"
> /mol_type="genomic DNA"
> /organism="Enterococcus sp."
> /strain="LMG12316"No further qualifiers will be added for this feature
> ---------------------------------------------------
> So i was just wondering that is this warning message causing that problem or
> i am doing something wrong?

"Unbalanced quote" means there is not an even number (multiple of 2)
double-quote (") symbols around the tag's value. I can see that the
first line (below) looks problematic:

YOU HAVE:

/db_xref="taxon:35783" /germline"

SHOULD BE:

/db_xref="taxon:35783"
/germline

I suspect there is a problem either with RDP's genbank producer, or
Bioperl is having problem with  the "germline" qualifier which is a
'null valued' qualifier like /pseudo - it takes no ="value" string. (I
think in Bioperl this is handled by setting the value to "_no_value"
?)

http://www.ncbi.nlm.nih.gov/collab/FT/

Qualifier       /germline
Definition      the sequence presented in the entry has not undergone somatic
                rearrangement as part of an adaptive immune response; it is the
                unrearranged sequence that was inherited from the parental
                germline
Value format    none
Example         /germline
Comment         /germline should not be used to indicate that the source of
                the sequence is a gamete or germ cell;
                /germline and /rearranged cannot be used in the same source
                feature;
                /germline and /rearranged should only be used for molecules that
                can undergo somatic rearrangements as part of an
adaptive immune
                response; these are the T-cell receptor (TCR) and immunoglobulin
                loci in the jawed vertebrates, and the unrelated variable
                lymphocyte receptor (VLR) locus in the jawless fish (lampreys
                and hagfish);
                /germline and /rearranged should not be used outside of the
                Craniata (taxid=89593)


--Torsten Seemann
--Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
University, AUSTRALIA




More information about the Bioperl-l mailing list