[EMBOSS] FW: Reducing a FASTA repository, new user
Marvin Stodolsky
marvin.stodolsky at gmail.com
Fri Feb 18 02:23:15 UTC 2011
Sorry,
Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.
MarvS
On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kellert at ohsu.edu> wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions. A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository. This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>>>
>>>> This is elementary I’m sure, but I’ve been unable to work out the
>>>> syntax from the documentation.
>>>> More minor issue.
>>>>
>>>> When using infoseq to extract all the fasta Headers from a sequence
>>>> Repository, the GeneBegin..GeneEnd (like 234466..234589) often fails to
>>>> come as a uniform field/fields in a resultant spreadsheet. Is there a Fix
>>>> for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>> -nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>> -nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
-------------- next part --------------
With respect to using info in the FASTA description field, the intent and partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e.,
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0
DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]
In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.
Pulling the dataset into a spreadsheet, spaces in the description field were next replaced with >< :
DnaJ><domain><protein><1828..2760(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)><[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)><[Mycoplasma><genitalium><G37]
thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37]
Next ><[ is replace by "to be field seperator" |[
DNA><polymerase><III,><beta><subunit><686..1828(+)|[Mycoplasma><genitalium><G37]
DnaJ><domain><protein><1828..2760(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)|[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)|[Mycoplasma><genitalium><G37]
and the file saved as: Myc637000176m.csv
to get rid of >< in the terminal common [Mycoplasma><genitalium><G37], there was done
$ cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv
resulting in :
DNA><polymerase><III,><beta><subunit><686..1828(+)|
DnaJ><domain><protein><1828..2760(+)|
DNA><gyrase><subunit><B><2845..4797(+)|
DNA><gyrase><subunit><A><4812..7322(+)|
seryl-tRNA><synthetase><7294..8547(+)|
internals are next mostly deleted with:
sed -e 's/<.*>//g' Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|
The single remmaining >< is replaced with potential separator |
sed -e 's/></|/g' Myc637000176m3.csv > Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine manipulation.
A cleanup was done with:
sed -e 's/)|//g' Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g' Myc637000176m5.csv > Myc637000176m6.csv
together changing the (+)| to |+ ,that is a separated field
The replacement of the residual .. with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file
After routine manipulations within the spread sheet,
a view of the overlap detection section is:
F G H I J fields
Start End Begin-nextEnd OR((H2<0),(H1<0)) Stable 0/1 Value, for SORTING on
686 1828 0 FALSE 0
1828 2760 85 FALSE 0
2845 4797 15 TRUE 0
4812 7322 -28 TRUE 1
7294 8547 4 TRUE 1
8551 9183 -27 TRUE 1
9156 9920 3 FALSE 0
9923 11251 0 FALSE 0
The overlapping genes have stable value 1,during sorting, while field I FALSE/TRUE and not stable during SORTing
More information about the EMBOSS
mailing list