In early 2014, I joined the University of São Paulo determined to become a scientist. I got a degree in Biomedical Sciences, in a course very oriented to the academic world. During undergrad times, with a rate of more than one internship per semester I set foot at and learned from 9 different labs in 4 countries: Brazil, Israel, Switzerland, and Austria; besides taking part in competitions in Boston, USA, going to BIOMOD and iGEM a couple times.
And yet, before enrolling in a Bioinformatics graduate program, I had never adequately heard about ontologies. This leads me to infer that it is likely that the average biomedical scientist knows very little about ontologies. And yet, we are all ontologists!
By ontology, I refer to computational ontologies, of the kind that are presented as “structured vocabularies” to students, a very dull description. They should be presented as a highly technical toolbox to make computers know things. The philosophical ontology (the study of the nature of being) is related and fascinating, but not quite in the scope here.
In any case, when someone in the life sciences hears the word ontology any combination of three things happens:
- They mistake it for ontogeny, or how organisms come to be
- They immediately think of the Gene Ontology project and gene sets, and using those sets in data analysis
- They think about structured vocabularies and cataloging. In other words, boring stuff.
Why do I say that? Well, two years ago, my idea of ontology was some combination of the three things above. I am going to walk you through how everything changed. And for that, I will make a little digression into the past.
On a Brazilian summer day, back in January 2019, I started writing a scientific manuscript with my advisor. I got a piece of advice: first, get the ideas you want to share, and then work on sculpting the words. I went to the laboratory balcony, facing the São Paulo sunset, and started to think about it.
The best articles are indeed pieces of poetry; they are pleasing to read and convey complex ideas naturally. But in all cases, we have to deconvolve the work of the writers. The concepts in their minds become large chunks of text, and then we try to get the same concepts out of the large pieces of text. Scientific writing is a kind of (beautiful) cryptography.
However, we do not always have the cryptography keys. Sometimes we can’t get the concepts out because the writing was not excellent. Or because the report is ambiguous. Or because we are somehow biased. Even in ideal conditions, this decoding process is inherently lossy, which makes science harder than it can be.
At a high level of writing, there seems to be a trade-off between precision and clarity. If we describe precisely how we sorted our astrocytes in the abstract, it won’t be readable. And yet, differently sorted astrocytes may behave differently. Why should we rely only on plain text, then? Why don’t we use some rigorous way of representing the ideas?
Well, at least some researchers do. The formal representation of concepts is at the core of maths. The success of maths in changing our world can be attributed to formalization. And the use of formal ontology-like conceptualizations for years. I mean, we have numbers for millennia. The concept of “zero” is not trivial per se; it is a representation of a complex idea. And so goes on for basic maths concepts, like “division” or “exponentiation.” Mathematical theorems are rigorously represented, and may go unsolved for centuries without changing their meaning. I cannot think of any natural language writing that would retain meaning throughout centuries as well as maths.
Basic calculators would not work without the formalization of mathematical concepts. Together, all mathematicians in the world, using pen and paper, could not solve the calculations a phone does in a single day. Explicit, formal ontology in maths has led to the development of the powerful, god-like machines that we call computers.
And how do we write biology in equations, then? Can biology reach the same level of rigorousness as mathematics, and lead us to a whole new era of biological advances? “
It was already night in São Paulo, and I had to leave the balcony by that time. I eventually stopped procrastinating and came back to writing the manuscript. Still, the questions raised at that moment followed me throughout the years.
I started looking for ways how we biologists could represent this kind of thing. I started reading about formal logic, about modeling of biological systems with programming, about anything, really, that could be of use. I ended up reading about knowledge graphs and how Google uses the Google Knowledge Graph to power the world’s most potent searching engine.
Google takes concepts and links them, using formal representations that are familiar to the human mind. A concept like “Jimi Hendrix“ is related by formal associations to certain values. If I google “Jimi Hendrix birthday”, for example, I immediately get a result: November 27, 1942. That date is linked to “Jimi Hendrix” via a property of “date of birth.” The same goes for the date of death, the place of birth and to on.
Google can use its knowledge graph for complex things. Search for “length of a whale,” and the knowledge graph will provide you with the average length for many whale species. Biologists studied that for centuries, published results, and these results have now become knowledge. Whale aficionados know those numbers, and now the Google Knowledge Graph knows them too. And if Google knows them, then I know them too, with a delay of some seconds!
But who told Google? For the whale length, it was the Encyclopedia of Life database. For Jimi Hendrix, it is unclear, but it seems to be Wikipedia. How is that possible? Can Google read natural language texts on Wikipedia? Well, probably yes, as there are projects that infer knowledge graphs from Wikipedia (DBPedia) here and there. But that is not the only way; there is one other answer is much more interesting: Wikidata.
Wikidata is like Wikipedia, but for concepts and their relations. The Google Knowledge Graph uses it, and Wikipedia uses it, so you use it too. Wikidata is a public knowledge graph that anyone can edit. It is a fast track between minds that allows representation of complex concepts and lots of knowledge. With a good Wikidata search, we can get a list of football players that got married to singers within a couple of minutes (do click on the link). And we are sure that it means the football played with round balls, and not any other sport. Wikidata does for concepts what calculators do for numbers, enabling miraculous, god-like searches.
And we can represent biological concepts in Wikidata too. I turned my focus to Wikidata, and on how to harness its power for the life sciences. Some researchers were already starting to feed knowledge to Wikidata! Still on 2019, a preprint (now published in eLife) l made it clear that Wikidata can be the knowledge graph for the life sciences, and that is revolutionary. I got more involved with Wikidata with time. I went to Cambridge for working with Wikidata in the eLife Sprint Hackathon; I got a PhD scholarship to study the concept of cell types on Wikidata, and Wikidata is really a like passion to me.
In the meanwhile, I realized that Wikidata was just a new face to something going on for decades: ontologies! Yes, those very same ontologies that I disregarded as boring and useless. The Gene Ontology is only one of the hundreds of such conceptualization efforts. The OBO Foundry is a collection of hundreds of biomedical ontologies that represent concepts all over the life sciences spectrum. Literally, thousands of people have worked on such ontologies, which are there, with rigorous concepts and their relations. Many of them have already been integrated into Wikidata. And still, the bulk of life sciences largely ignores biomedical ontologies.
That lack of broader use of ontologies is, unfortunately, not surprising. Understanding ontology articles requires a lot of particular studying. The papers in the Journal of Biomedical Semantics, one of the main venues for the area, are generally hard to read for outsiders. It takes time to understand even the surface. The most accessed article in that journal (a great article, by the way) resumes its challenge as:
“Previously, multi-species anatomy ontologies contained a mixture of unique and overlapping content. This hampered integration and coordination due to the need to maintain cross-references or inter-ontology equivalence axioms to the single-species model organism anatomy ontologies, or to perform large-scale obsolescence and modular import”
Well, any topic has its jargon, and indeed, formalizations are hard. Taking part in current ontology projects requires a sizeable investment of time, which many scientists may not be willing to invest at first. Plus, ontologists like precision a lot, and that makes it a challenge to write for outsiders. This communication barrier has set apart the ontology and life sciences communities for decades.
Wikidata is a powerful bridge to walk from the life sciences bench to the ontology world. Wikidata lowers the barrier so smoothly that I took months to see that Wikidata and ontologies were two sides of the same coin. That smoothed introduction made me pass the energy barrier to contribute (small contributions so far) to formal ontologies of the OBO Foundry, as the Cell Ontology.
Now I am completely in love with ontologies. I already was. We already are. All the time, what we do as scientists is to get the words from biomedical texts into concepts in our heads. We all know concepts: an astrocyte, a P53 protein, glycosylation, mitosis and so on. We establish relations about these concepts and can understand science even when the same things are named differently across articles.
We all even integrate knowledge we acquire in different languages! Bilinguals somehow merge ideas that come from completely different sentence structures, words, phonemes, and we match them to the same concepts in our head. We integrate, make logical assumptions, build hypotheses using information adjusted to our natural ontologies.
There is a lot of power in assuming our nature as ontologists. Just think about the scientists that you admire the most, and how wonderful it would be to understand all concepts that they have ever published. The concepts are already distributed in minds across the world, crying for integration. Matching the concepts we already know to a common, public database would make this integration seamless.
You are already an ontologist. Be calm; you do not have to learn a whole new area of study to harness this power. We are in the stone age of the use of formal biology concepts. Take a minute to think about the core concepts for your research that may have more than one name. Done: that was an ontological conceptualization. Go to Wikidata and explore it a little and see how people have added concepts there. Take a look on the Wikidata and the life sciences article. Open the OBO Foundry page and scroll a bit. Get to know about the International Society of Biocuration. You are already part of all those efforts.
Ontologies are the straight opposite of boring and useless. Aligned with open science efforts and the public knowledge integration of Wikidata, ontologies are one of the most revolutionary tools of our generation. So be excited about it! Spread the word, support grants, read and share these efforts. If we embrace ontologies for scientific communication, we can take life sciences to a whole new level.