#This is outdated The Ortholog Database Connector gives a comprehensive view of multiorganism orthologs of a given human gene, as produced by diverse ortholog sorting algorithms. For example, the table for the human gene P00395 has columns containing the membership of the ortholog group P00395 belongs to according to each database.
Orthologous genes are genes which arise from speciation, for example, Human Talin, and Mouse Talin. Identification and accuracy of genes classed as orthologs is key for comparative genomic approaches, however there is no consensus model. Approximately 30 independent algorithms exist to classify orthologous genes; some based on species and gene phylogeny and some using exclusively sequence comparison. Phylogeny based approaches have the advantage of tracking genes through realistic evolutionary paths, however tend to be computationally intensive, and subject to error from misconstructed gene trees. BLAST approaches are much faster, generally using an all-by-all blast of multispecies proteomes in order to generate best matches, however lack the added power of phylogenetic information.
There is often no clear answer for which database to use to generate lists of orthologs for a comparative project. Confounding this problem is the lack of a method to easily compare results from different databases. Additionally, precomputed data available from the databases use various gene naming methods, or different releases of the same naming type with non-overlapping gene names. The Orthology Database Connector attempts to inform research using orthologs by standardizing and aligning ortholog groups from ten separate ortholog grouping algorithms. Additionally, the number of databases which call a gene as an ortholog may be used as a proxy for confidence of that gene’s assignment of orthology to the reference gene. The ability to visualize and compare ortholog groupings from multiple sources will aid comparative study of proteins and genes.
The cross-referencing of UniProt with phylogenomic databases facilitated this project. UniProt entries can be filtered by presence in a ortholog database, and downloaded along with their database assigned group ID. This approach was used for retrieving eggNOG, GeneTree, HOGENOM, HOVERGEN, KO, OMA, OrthoDB, and TreeFAM databases. The entire 2014 release of the COG database was download directly from the COG website, and the given RefSeqIDs converted to UniProt using the downloadable UniProt ID mapping table. The Human PhylomeDB was downloaded from the PhylomeDB website formatted with Uniprot Accession. For both COG and PhylomeDB databases it was necessary to convert deprecated Accessions to current versions, as well as map up to date UniProt gene names.
- eggNOG clusters protein sequences using the FASTA algorithm to derive maximum likelihood gene family trees
- Version 4.0
- 2031 species, 7.7 million proteins, 1.7 million ortholog groups
- "Protein trees are constructed using a representative protein for every gene in Ensembl: proteins are clustered using hcluster_sg based on NCBI BLAST+ e-values, and each cluster of proteins is aligned using M-Coffee or Mafft. Finally, TreeBeST is used to produce a gene tree from each multiple alignment, reconciling it with the species tree to call duplication events."
- Ensembl release 79, March 2015
- 69 Eukaryotic species </ul> COGS - Phylogenetic classification of proteins encoded in complete genomes
- COG uses pairwise sequence comparisons and best hit analysis to group proteins into clusters, with a focus on funcational category prediction and annotation
- 2014 Update
- Focus on bacteria and archaea, 711 species
- HOGENOME uses all-by-all blastP2 followed by clustering, multiple sequence alignment and phylogenetic tree analysis
- Release 06, December 2011
- 1233 bacteria, 97 archaea, 140 eukaryotes
- HOVERGEN uses all-by-all blastP2 followed by clustering, multiple sequence alignment and phylogenetic tree analysis
- Release 49, December 2009
- All vertebrate protein sequences from the Uniprot Knowledgebase
- "The KO system is a collection of manually defined ortholog groups (KO entries), which are categorized under the hierarchy of KEGG pathways and BRITE ontologies".
- KEGG release 74, April 2015
- 304 eukaryotes, 3504 prokaryotes
- OMA uses an all-by-all Smith Waterman alignment to find mutually closest homologs based on evolutionary distance. Orthologs are clustered to (1) pairwise to identify OMA ortholog cliques, or (2) hierarchically to identify HOGS (Hierarchical Ortholog Groups)
- September 2014 release
- 1706 species (eukaryote, bacteria, archaea)
- OrthoDB clusters based on best-reciprocal-hits from an all-by-all Smith-Waterman algorithm, and then references clusters to species phylogenies.
- Version 8
- 52 vertebrates, 45 arthropods, 142 fungi, 13 basal metazoans, and 1115 bacteria
- "PhylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming and evolutionary model testing."
- Human Phylome version 3, 2011, PhylomeDB version 4
- 39 species proteomes referenced to human proteome
- "TreeFam is a database composed of phylogenetic trees inferred from animal genomes. It provides orthology/parology predictions as well the evolutionary history of genes."
- Release 9, March 2013
- 109 Species, 15,736 gene families