[EMBOSS] FW: Reducing a FASTA repository, new user

Fri Feb 18 02:23:15 UTC 2011

Sorry,

Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.

MarvS

On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kellert at ohsu.edu> wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository.  This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>>>
>>>>  This is elementary I’m sure, but I’ve been unable to work out the
>>>> syntax  from the documentation.
>>>> More minor issue.
>>>>
>>>> When using infoseq to extract all the fasta Headers from a sequence
>>>> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
>>>> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
>>>> for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>>  -nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>>  -nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
-------------- next part --------------

With respect to using info in the FASTA description field, the intent and partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., 
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0

DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]

In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.

Pulling the dataset into a spreadsheet, spaces in the description field were  next replaced with >< :
DnaJ><domain><protein><1828..2760(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)><[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)><[Mycoplasma><genitalium><G37]
thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37]

Next ><[ is replace by "to be field seperator"  |[
DNA><polymerase><III,><beta><subunit><686..1828(+)|[Mycoplasma><genitalium><G37]
DnaJ><domain><protein><1828..2760(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)|[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)|[Mycoplasma><genitalium><G37]
and the file saved as:   Myc637000176m.csv 

to get rid of >< in the terminal common  [Mycoplasma><genitalium><G37], there was done 
$  cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv
resulting in :
DNA><polymerase><III,><beta><subunit><686..1828(+)|
DnaJ><domain><protein><1828..2760(+)|
DNA><gyrase><subunit><B><2845..4797(+)|
DNA><gyrase><subunit><A><4812..7322(+)|
seryl-tRNA><synthetase><7294..8547(+)|

internals are next mostly deleted with:
sed -e 's/<.*>//g'  Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|

The single remmaining >< is replaced with potential separator | 
sed -e 's/></|/g'  Myc637000176m3.csv > Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine manipulation.

A cleanup was done with:
sed -e 's/)|//g'  Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g'  Myc637000176m5.csv > Myc637000176m6.csv
together changing the  (+)|  to   |+   ,that is a separated field

The replacement of the residual  ..  with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file

After routine manipulations within the spread sheet, 
a view of the overlap detection section is:
 F       G      H                I               J  fields
Start 	End Begin-nextEnd  OR((H2<0),(H1<0))  Stable 0/1 Value, for SORTING on		
686	1828	0		FALSE		0		
1828	2760	85		FALSE		0		
2845	4797	15		TRUE		0		
4812	7322	-28		TRUE		1		
7294	8547	4		TRUE		1		
8551	9183	-27		TRUE		1
9156	9920	3		FALSE		0	
9923	11251	0		FALSE		0	

The overlapping genes have stable value 1,during  sorting, while field I FALSE/TRUE and not stable during SORTing