[BioSQL-l] Re: getting exon information from genbank files

Hilmar Lapp hlapp at gnf.org
Tue Apr 12 13:17:55 EDT 2005


Thanks. Help is always appreciated and sample queries will surely be 
helpful to people.

	-hilmar

On Apr 12, 2005, at 2:09 AM, ankit soni wrote:

> Sorry for the confusion the values were masked  they were not actual 
> values .
> Later I was able to figure out how to do the stuff what  I needed.
> I am developing few example SQL queries which I will post on the list 
> soon.
>
> Thanks for helping.
> Ankit Soni
>
>
>
> On Mon, 11 Apr 2005 11:55:09 -0700, Hilmar Lapp <hlapp at gnf.org> wrote:
>> Ankit, the values you're showing in your sample record, did you make
>> them up entirely or is this an actual query result?
>>
>> Note that all columns in the location table are numeric, so it only
>> creates confusion if you choose letters as characters to mask the real
>> values. If they are the real values that you must have changed the
>> schema and not used load_seqdatabase.pl to load records.
>>
>> Note also that generally what's in biosql will closely resemble the
>> object model that was built by the SeqIO bioperl parser run on your
>> input record(s) - provided you used load_seqdatabase.pl to load the
>> record(s). So, what ends up in biosql as the result of loading a
>> genbank record greatly depends on the genbank record itself. As a 
>> rule,
>> what the genbank record had in its feature table you'll also find in
>> biosql as a seqfeature record, and what wasn't in the feature table 
>> you
>> also won't find in biosql. Introns are usually not annotated in 
>> Genbank
>> explicitly, they are only implicit as the region between exons, so
>> unless the genbank record you loaded were exceptions you . How to find
>> exons again depends on the feature table of the original records: some
>> have a single cDNA feature with a composite ('split') location, which
>> will end up in biosql as one seqfeature that has many locations
>> attached. Genomic contigs sometimes have the exons annotated as
>> individual features, and then this is what you'll find in biosql too:
>> one seqfeature per exon, each with a single location.
>>
>> The bottom line is, if you load through load_seqdatabase.pl the 
>> content
>> in biosql will closely match the object tree in bioperl - which often
>> times will be close to the data structure of the original input 
>> record.
>> Features that weren't there to begin with you won't find magically
>> added.
>>
>> So, to come back to your question, there is no good answer because it
>> greatly depends  on what your input was. Most likely though you'll 
>> have
>> to impute introns by fetching the locations of the cDNA (or mRNA)
>> feature or the locations of the exon features, order them properly, 
>> and
>> then infer introns between consecutive exons.
>>
>> If this is what you need to do all the time I'd write a script that
>> does this in an automated fashion against all newly loaded records and
>> inserts the introns as features back into the database.
>>
>> 	-hilmar
>>
>> On Sunday, April 10, 2005, at 11:04  AM, ankit soni wrote:
>>
>>> Hi all,
>>> I have just started using BioSQL for one of my projects and I have
>>> loaded few genbank files in the MySQL database using BioPerl and the
>>> standard schema.
>>> I wanted to ask how can I get the information about the exons, 
>>> introns
>>> from the database.
>>> If I use the following querry I get the start and end position but I
>>> am not able to find out what limits(start_pos and end-pos) stand for
>>> i.e. gene or exon or intron.
>>> mysql> select * from location where seqfeature_id='XXXX';
>>> +-------------+---------------+-----------+---------+-----------
>>> +---------+--------+------+
>>> | location_id | seqfeature_id | dbxref_id | term_id | start_pos |
>>> end_pos | strand | rank |
>>> +-------------+---------------+-----------+---------+-----------
>>> +---------+--------+------+
>>> |       YYYY |         XXXX  |      NULL  |    NULL |      ABC  |
>>> EFG |      1    |    1     |
>>> +-------------+---------------+-----------+---------+-----------
>>> +---------+--------+------+
>>>
>>> It would be very helpful if somebody can guide me.
>>> I am sorry if I am unable to use the correct biological terms as I
>>> know very little of biology.
>>>
>>> Ankit Soni
>>> Junior Undergraduate
>>> Dept. of Computer Science
>>> IIT kanpur
>>> India
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>> -- 
>> -------------------------------------------------------------
>> Hilmar Lapp                            email: lapp at gnf.org
>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>>
>>
>>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list