[Bioperl-l] GenBank Flat File Parser

Hilmar Lapp hlapp@gnf.org
Mon, 9 Sep 2002 17:18:18 -0700


Hi John,

sounds interesting. Do you have a coarse comparison to the Bioperl 
genbank parser with respect to both features and speed? If it's much 
faster or if it's event-based, would you consider integrating your 
parser into the bioperl framework?

	-hilmar

On Monday, September 9, 2002, at 04:50  PM, John Kloss wrote:

> I don't know if this is the right place to announce this.
>
> I've created a generic GenBank Flat File parser.  It's modular, easily
> modified (to keep up with the changing flat file format), and pretty
> fast.  Hopefully it's easy to use.  I have to parse flat files a lot so
> the code is useful to me.  Hopefully, it's useful to you.
>
> It's available at
>
> 	sapiens.wustl.edu/~jkloss/GenBankParser.tar.gz
>
> It's a module.  Just
>
> 	gzip -cd GenBankParser.tar.gz | tar xvf -
> 	cd GenBankParser
> 	perl Makefile.PL
> 	make
> 	make install
>
> I have two sample programs at
>
> 	sapiens.wustl.edu/~jkloss/gb2fasta.txt
> 	sapiens.wustl.edu/~jkloss/gt2fasta.txt
>
> One parses the flat file and spits out nucleic fasta format, the other
> protein.
>
> As a quick example of it's use (though the perldoc info gives a lot of
> examples), if I want to parse out the DEFINITION field, the VERSION
> fields accession.version, and the LOCUS date field for each entry I
> would code
>
> 	use GenBankParser qw( DEFINITION LOCUS VERSION );
>
> 	my $Parser = new GenBankParser;
>
> 	$Parser->parse_file( \*STDIN, sub {
> 		my $parser = shift;
>
> 		print $parser->VERSION->accession,	"\n";
> 		print $parser->VERSION->version,	"\n";
> 		print $parser->LOCUS->date,		"\n";
> 		print $parser->DEFINITION,		"\n";
> 	});
>
> That's about it.
>
> The parser parses the FEATURES table, too.  So if I wanted to parse out
> the translation, accession, version, and note field for every CDS in an
> entry except those which are pseudo genes I would code
>
> 	use GenBankParser qw( FEATURES );
> 	use GenBankParser::FEATURES qw( CDS );
>
> 	(new GenBankParser)->parse_file( \*STDIN, sub {
> 		my $parser = shift;
>
> 		foreach my $cds ( @{ $parser->FEATURES->CDS } ) {
>
> 			next if $cds->pseudo;
>
> 			print $cds->accession,	"\n";
> 			print $cds->version,	"\n";
> 			print $cds->note,		"\n";
> 			print $cds->translation,"\n";
> 		}
> 	});
>
> Anyway, if you find it useful, let me know.
>
> 	John Kloss <jkloss@sapiens.wustl.edu>
> 	Systems Admin., Database Admin., Programmer.
>
> 	Gish Lab, Genome Sequencing Center
> 	Washington University Medical School ... in St. Louis
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------