[Bioperl-l] GenBank Flat File Parser

John Kloss jkloss@sapiens.wustl.edu
Mon, 9 Sep 2002 16:50:17 -0700


I don't know if this is the right place to announce this.

I've created a generic GenBank Flat File parser.  It's modular, easily
modified (to keep up with the changing flat file format), and pretty
fast.  Hopefully it's easy to use.  I have to parse flat files a lot so
the code is useful to me.  Hopefully, it's useful to you.

It's available at

	sapiens.wustl.edu/~jkloss/GenBankParser.tar.gz

It's a module.  Just

	gzip -cd GenBankParser.tar.gz | tar xvf -
	cd GenBankParser
	perl Makefile.PL
	make
	make install

I have two sample programs at

	sapiens.wustl.edu/~jkloss/gb2fasta.txt
	sapiens.wustl.edu/~jkloss/gt2fasta.txt

One parses the flat file and spits out nucleic fasta format, the other
protein.

As a quick example of it's use (though the perldoc info gives a lot of
examples), if I want to parse out the DEFINITION field, the VERSION
fields accession.version, and the LOCUS date field for each entry I
would code

	use GenBankParser qw( DEFINITION LOCUS VERSION );

	my $Parser = new GenBankParser;

	$Parser->parse_file( \*STDIN, sub {
		my $parser = shift;

		print $parser->VERSION->accession,	"\n";
		print $parser->VERSION->version,	"\n";
		print $parser->LOCUS->date,		"\n";
		print $parser->DEFINITION,		"\n";
	});

That's about it.

The parser parses the FEATURES table, too.  So if I wanted to parse out
the translation, accession, version, and note field for every CDS in an
entry except those which are pseudo genes I would code

	use GenBankParser qw( FEATURES );
	use GenBankParser::FEATURES qw( CDS );

	(new GenBankParser)->parse_file( \*STDIN, sub {
		my $parser = shift;

		foreach my $cds ( @{ $parser->FEATURES->CDS } ) {

			next if $cds->pseudo;

			print $cds->accession,	"\n";
			print $cds->version,	"\n";
			print $cds->note,		"\n";
			print $cds->translation,"\n";
		}
	});

Anyway, if you find it useful, let me know.

	John Kloss <jkloss@sapiens.wustl.edu>
	Systems Admin., Database Admin., Programmer.

	Gish Lab, Genome Sequencing Center
	Washington University Medical School ... in St. Louis