The origins of genetics date back to the 19th-century scientist and Augustinian friar Gregor Mendel. He studied traits inheritance and how traits are handed down from parents to offspring. Mendel is wildly known for creating the originally controversial and nowadays foundational to genetics Mendelian inheritance model (Mendel's law). Mendel derived his model from cross pollinating pea plants. In the late 20th century, with the advancement of technology, the discovery of the double helix (the structure of DNA), and the sequencing of the Human Genome, genetics flourished and gave rise to a number of sub-fields such as epigenetic and population genetics. Modern genetics has expanded beyond the study of inheritance to the study of the function and behavior of genes.
Human genomes (the complete set of genes or genetic material present in a cell or organism) consist of 3 million DNA base pairs. They contain both protein-coding DNA genes and noncoding DNA. Of those 3 million base pairs, estimates suggest, between 20 and 25 thousand genes are protein-coding genes. This accounts for about ~1.5 - 2% of the genome and the remaining ~98.5 - 98% consists of non-coding RNA molecules, regulatory DNA sequences, LINEs, SINEs, introns, and sequences with unknown function.
Based on these protein-coding genes, medical genetics studies the correlation between genetic variations and human disease and pharmacogenetics studies how genotype can affect drug responses. There has been a rapid increase in our understanding of disease and many conditions that had plagued humans for centuries are now proven to have and are linked to genetic origins. We have evidence that the majority of diseases cannot be attributed to a single gene but arise due to complex interactions between multiple genes.
Visualizing Gene-Disease association, a research project in information design, aims at applying and interweaving advancements in data engineering, human factors, and data visualization with science. Data, related to genetics, usually live in databases rightfully aimed at geneticists and is often accessed via massively complex interfaces or database querying.
A complex dataset, such as this, contains a lot of the challenges that remain partially solved or unresolved by the domains of information architecture, design, and visualization as applied to both science and business. The data visualization research question in this alpha iteration deals with the resolution of different degrees of complexity found in the dataset without subordinating more simple cases to an interface optimized for complexity. This meant equalizing the experience, via zooming capability, for both cases of one to one association – Milroy’s disease, and one to close to 500 associations – breast neoplasms, prostatic neoplasms, autistic disorders.
The question from an interface and information architecture
standpoint deals with the design of a system that provides much needed context without cluttering the interface to a point of “choice paralysis.” Multiple contextual layers are provided and respond to user interaction. The interface also responds to the user’s intent or current focus, as captured by their position in the browser, by attaching and releasing the filters needed for interacting with what is visible on the screen. Finally the concept of a “research path” was introduced. The Research Path preserves the user’s selections and visualizes the steps they take in interacting with the visualizations. This approach holds value as it permits users to retrace their steps and visualize their hypothesis or train of thought.
There are multiple questions and type of visualizations that remain to be pursued and improvements to the current interface that will be developed in a beta version containing a larger subset of gene-disease associations. The goal for the final outcome is to develop a resource that can 1) provide researchers and scientists from backgrounds different than genetics, and other closely related fields, to this data via a uniformed interface and intuitive visualization; 2) provide the general public with an augmented understanding of the origin of disease; 3) and finally, the long term goal is to create a resource that can facilitate the work of geneticists.
Visualizing Gene-Disease association uses curated gene-disease associations from UNIPROT and CTD (human subset), developed by DisGeNET, to visualize associations between diseases and genes. The DisGeNET database integrates human gene-disease associations from various expert curated databases and text-mining derived associations including Mendelian, complex and environmental diseases (Piñero et al., 2015; Bauer-Mehren et al, 2011). The integration is performed by means of gene and disease vocabulary mapping and by using the DisGeNET association type ontology. For a detailed information of the methodology, see the original publications Piñero et al, 2015, Bauer-Mehren et al,2011 and Bauer-Mehren et al, 2010.
The database for this visualization contains 6753 unique diseases and 7878 unique genes. The genes in the dropdown menu are organized in descending order by their number of associations with diseases (the more associations, the higher in the list it appears). The gene symbol is from the HUGO Gene Nomenclature Committee (HGNC) database. The gene's official full name is from the National Center for Biotechnology Information (NCBI), and uses Uniprot accession number. The diseases are organized by their Unified Medical Language System (UMLS) semantic type and classified according to the Medical Subject Headings (MeSH) vocabulary.
For detailed information describing the procedures employed to calculate association scores and the association ontology refer to: DisGeNET DB Info, sections Association Score & Association Type Ontology.