[Bioperl-l] SeqIO::table
Brian Osborne
brian_osborne at cognia.com
Sun Apr 17 11:00:00 EDT 2005
Hilmar,
Yes, this is a good idea, like the existing 'tab' format but with more
information.
Brian O.
-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Hilmar Lapp
Sent: Friday, April 08, 2005 8:15 PM
To: Bioperl
Subject: [Bioperl-l] SeqIO::table
I wrote two new SeqIO-compliant streams that will return Bio::Seq
objects from a table in either column-delimited ASCII text-format or
contained in an Excel worksheet inside an Excel file, respectively.
The table in either format is presumed to contain one seq per line (or
row). The parser allows you to identify a few columns with implied
semantic meaning (display_id, accession, species, sequence string). All
other columns may be selectively chosen to be preserved in the
annotation bundle.
The motivation for this was that several comprehensive gene family
publications made their data available in manually curated
spreadsheets. I needed these data as a SeqIO-compliant stream, and
going through an intermediary fasta file can mess up the annotation a
lot.
If anybody else is interested in this or if anybody else thinks this
could be of general interest I'll commit it to bioperl.
I've enclosed the supported arguments for the SeqIO::table::new method,
this will give an idea of what is configurable. The excel parser
supports the same arguments and the name of the worksheet in addition.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
Named parameters supported by the proposed Bio::SeqIO::table:
-comment leading character(s) introducing a comment line
-header the number of header lines to skip; the first
non-comment header line will be used to obtain
column names; column names will be used as the
default tags for attaching annotation.
-delim the delimiter for columns as a regular expression;
consecutive occurrences of the delimiter will
not be collapsed.
-display_id the one-based index of the column containing
the display ID of the sequence
-accession_number the one-based index of the column
containing the accession number of the sequence
-seq the one-based index of the column containing
the sequence string of the sequence
-species the one-based index of the column containing the
species for the sequence record; if not a
number, will be used as the static species
common to all records
-annotation if provided and a scalar, a flag whether or
not all additional columns are to be preserved
as annotation, the tags used will either be
'colX' if there is no column header and where X
is the one-based column index, and otherwise the
column headers will be used as tags; if a
reference to an array, only those columns
(one-based index) will be preserved as
annotation, tags as before; if a reference to a
hash, the keys are one-based column indexes to
be preserved, and the values are the tags under
which the annotation is to be attached; if not
provided or supplied as undef, no additional
annotation will be preserved.
-trim flag determining whether or not all values should
be trimmed of leading and trailing white space
_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list