Genes and proteins from the following ten organisms can be searched using gene names or systematic identifiers from the corresponding Model Organism Database (MOD), Ensembl, NCBI or UniProt: H. sapiens, M. musculus, R. norvegicus, A. thaliana, C. elegans, D. discoideum, D. melanogaster, P. falciparum, E. coli, S. pombe, and S. cerevisiae. It is possible to search with identifiers and names from MGI for mouse, RGD for rat, TAIR for Arabidopsis, FlyBase for fly, WormBase for worm, dictyBase for Dictyostelium, GeneDB for Plasmodium, EcoGene for E.coli, GeneDB for S. pombe, and SGD for S. cerevisiae. For mouse MGI identifiers please prefix the number with "MGI:", and for rat RGD identifiers, prefix the number with "RGD:". For human, mouse and rat, it is also possible to search with Ensembl protein IDs. Some other identifiers also work when searching for organisms including UniProt accession numbers and NCBI protein GI identifiers. For NCBI GI identifiers, please prefix the number with "GI:". However, some of the searches may be limited by the quality of links to these databases from the relevant model organism database. GO information comes from the Gene Ontology home page.

YOGY is implemented in a MySQL relational database running on a UNIX server. Data for the external resources have been downloaded from the associated FTP and websites for import into this database. The data model has been validated to identify and remove potential problems such as many-to-many relationships. It uses perl scripts together with the perl DBI module for file import. For queries, we have designed a web interface using the CGI module of perl, hosted on an Apache server. The perl GD graphics module is used for bar charts.

Genes and proteins from the above nine organisms can be searched using gene names or systematic identifiers reported in the databases indicated above. Where possible, identifiers from the same MOD are shown throughout the output so that proteins from the five data sources can be checked for consistency; this is useful as the different homology resources use identifiers from a variety of databases. Because of the ambiguity of many identifiers, legacy naming systems and revisions to gene structures and gene complements, it is not always possible to be certain whether some apparent differences in orthology calls are, in fact, equivalent proteins. Whilst we have made every effort to map these identifiers automatically using resources from the MOD, the International Protein Index, UniProt, and the NCBI Gene database, any discrepancies should be checked manually by the user. It is possible to use incomplete names with a wild-card option, providing a list of genes and one-line descriptions for further search.

GO terms annotated to the identified orthologs can also be retrieved. Only associations using experimental and curator validated evidence codes are included. The option to show GO terms is switched off by default due to the increased time required to download GO data. Options are provided to display GO terms in separate tables at the end of each resource, or in a single table at the end of the output.

The output is provided in tabulated HTML format. The first table contains general information for the protein of interest including description and links to the corresponding MOD and the UniProt database, if this accession number is available. For S. pombe, links to gene expression profiles during the cell cycle (C), meiotic differentiation (M), and stress conditions (S) are also provided. The data sources which provide positive orthology results for the gene of interest are then specified with links to the corresponding outputs.

The orthology results are presented in a standard output format for each dataset. At the top is information about the query protein cluster(s), followed by a list of available orthologs ordered by organism together with links to the original databases. Links to UniProt are also provided if the accession number is available. Below, each data source is mentioned in the order given in the output page.

For KOGs, the summary table starts with the unique KOG name together with a link to the website. The next column displays a bar chart of the ortholog numbers for each organism, revealing the phylogenetic pattern for the KOGs. This chart also provides a link to a list of other KOGs that share the same phylogenetic pattern, which may help to understand how the protein has been conserved through evolution. The summary table also indicates the functional classification, with a link to other KOGs in this classification, and a one-line description for the KOG. The orthologs are displayed in a list below the summary table, together with links for each protein or domain to the corresponding KOG cluster alignments and to the relevant protein page at NCBI.

For Inparanoid, we excluded orthologs from largely unannotated organisms, which are not in the other homology resources; this reduces the output page to 18 organisms (20 databases as both mouse and rat are include two datasets). The bar chart on top shows the phylogenetic pattern for the orthologs. The list underneath shows the orthologs for the query protein, links to the Inparanoid protein clusters for each organism, the Inparanoid score, and a link to the protein page in the corresponding MOD. Inparanoid uses a sophisticated methodology to distinguish between in- and out-paralogs; we have downloaded the tables from the Inparanoid website and present these pre-calculated datasets on the YOGY website.

For Homologene, the summary at the top provides a link to the query protein cluster at NCBI. Each ortholog is then presented by organism with links to the relevant NCBI pages

For OrthoMCL, we have again excluded orthologs from largely unannotated organisms and prokaryotes (except Escherichia coli, which is also included in Inparanoid) reducing the output to 24 organisms. The summary table includes a link to the OrthoMCL cluster and a phylogenetic bar chart. This table is followed by a list of orthologs in the cluster with a link to the original protein sequence used for clustering and a link to the relevant MOD. For some of the less well characterised yeasts, which have no MOD, a link is provided to either the "Yeast Gene Order Browser" or Génolevures that both provide graphical representations of conserved genome location.

For the curated yeast ortholog dataset, only fission and budding yeast proteins are included. The output provides the lists of orthologs together with links to the S. pombe GeneDB and SGD databases.

To cite YOGY and for further details see:

Penkett CJ, Morris JA, Wood V, and Bähler J (2006). YOGY: a web-based, integrated database to retrieve protein and associated Gene Ontology terms.
Nucleic Acids Res. 34: W330-334.

A reprint is available as a PDF.