Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences.

PubWeight™: 0.75‹?›

🔗 View Article (PMC 2789693)

Published in Bioinform Biol Insights on November 24, 2009

Authors

Derek Gatherer1

Author Affiliations

1: MRC Virology Unit, Institute of Virology, Church Street, Glasgow G11 5JR UK.

Articles cited by this

EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet (2000) 69.26

The Bioperl toolkit: Perl modules for the life sciences. Genome Res (2002) 58.63

PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 45.07

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 43.68

Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res (2003) 38.75

Pfam: clans, web tools and services. Nucleic Acids Res (2006) 34.83

BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett (1999) 25.40

ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res (2006) 6.13

Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol (1997) 4.85

Comparative DNA analysis across diverse genomes. Annu Rev Genet (1998) 3.67

Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol (2001) 3.64

Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res (1994) 3.41

Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol (1998) 3.09

NRL-3D: a sequence-structure database derived from the protein data bank (PDB) and searchable within the PIR environment. Protein Seq Data Anal (1990) 2.92

Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc Natl Acad Sci U S A (1999) 2.38

Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci U S A (1997) 2.35

Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res (1987) 2.11

The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. Nucleic Acids Res (1987) 2.02

Gene structure prediction by linguistic methods. Genomics (1994) 1.81

Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites. Mol Biol Evol (2007) 1.77

Linguistic features of noncoding DNA sequences. Phys Rev Lett (1994) 1.75

Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn (1986) 1.63

The language of genes. Nature (2002) 1.59

Grammatical inference in bioinformatics. IEEE Trans Pattern Anal Mach Intell (2005) 1.52

Statistical evaluation and biological interpretation of non-random abundance in the E. coli K-12 genome of tetra- and pentanucleotide sequences related to VSP DNA mismatch repair. Nucleic Acids Res (1992) 1.50

Identification of human gene functional regions based on oligonucleotide composition. Proc Int Conf Intell Syst Mol Biol (1993) 1.43

Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Res (1998) 1.39

WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar. Nucleic Acids Res (2005) 1.34

Monotony of surprise and large-scale quest for unusual words. J Comput Biol (2003) 1.06

Grammatical model of the regulation of gene expression. Proc Natl Acad Sci U S A (1992) 1.03

The linguistics of DNA: words, sentences, grammar, phonetics, and semantics. Ann N Y Acad Sci (1999) 1.03

Phylogenetics of artificial manuscripts. J Theor Biol (2004) 1.02

Protein linguistics - a grammar for modular protein assembly? Nat Rev Mol Cell Biol (2006) 0.98

A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Comput Appl Biosci (1993) 0.93

Noncoding DNA, Zipf's law, and language. Science (1995) 0.90

Homonyms, synonyms and mutations of the sequence/structure vocabulary. J Mol Biol (1984) 0.90

A syntactic representation of units of genetic information--a syntax of units of genetic information. J Theor Biol (1991) 0.89

Oligonucleotide frequencies in DNA follow a Yule distribution. Comput Chem (1996) 0.89

The generative grammar of the immune system. Science (1985) 0.87

Combinatorial motif analysis and hypothesis generation on a genomic scale. Bioinformatics (2000) 0.86

Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dyn (1986) 0.86

Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster. Gene (2003) 0.85

Entity grammar systems: a grammatical tool for studying the hierarchical structures of biological systems. Bull Math Biol (2004) 0.84

Similarity in oligonucleotide usage in introns and intergenic regions contributes to long-range correlation in the Caenorhabditis elegans genome. Gene (1999) 0.84

Towards a unified grammatical model of sigma 70 and sigma 54 bacterial promoters. Biochimie (1996) 0.83

The modular structure of informational sequences. Biosystems (1996) 0.82

Search for ancient patterns in protein sequences. J Mol Evol (1996) 0.82

Is DNA a language? J Theor Biol (1997) 0.82

Word organization in coding DNA: a mathematical model. Theory Biosci (2006) 0.81

The prediction of human exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Proc Int Conf Intell Syst Mol Biol (1994) 0.80

Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis. Nucleic Acids Res (1996) 0.79

Deciphering the language of the genome. J Theor Biol (1997) 0.79

Preface to a grammar of biology. A hundred years of nucleic acid research. Science (1971) 0.79

Are grammatical representations useful for learning from biological sequence data?--a case study. J Comput Biol (2001) 0.78

DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity. Appl Bioinformatics (2003) 0.78

A grammar describing 'biological binding operators' to model gene regulation. Biochimie (1996) 0.77

Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Proc IEEE Comput Syst Bioinform Conf (2004) 0.77

A language to describe the growth of neurites. Biol Cybern (1993) 0.77

Kinetics: the grammar of enzymology. FEBS Lett (1976) 0.77

An improved method for detection of words with unusual occurrence frequency in nucleotide sequences. J Theor Biol (1993) 0.76

Hexanucleotide frequency database. Comput Appl Biosci (1997) 0.76

A large-scale comparison of genomic sequences: one promising approach. Acta Biotheor (2003) 0.76

A study of oligonucleotide occurrence distributions in DNA coding segments. J Theor Biol (1997) 0.76