Rita Sousa recently became a researcher at the University of Mannheim, working on the KI-DiabetesDetektion project. She holds a bachelor’s degree in health sciences and a master’s degree in bioinformatics and computational biology from the University of Lisbon. Her Ph.D. research has focused on integrating life sciences expertise with computational skills to develop new approaches to learning from complex biomedical data and discovering new knowledge.
What is your current research topic?
I am working on the KI-DiabetesDetektion project, funded by the German Federal Ministry of Education and Research, that brings together partners from various sectors, including academia and industry. This project aims to apply machine learning methods to improve the early detection of diabetes. Due to the multidisciplinary nature of diabetes, several data sources, including patient and gene expression data, will be integrated into a knowledge graph. This knowledge graph will then be explored to identify patterns and potential risk factors linked to diabetes. Prototypes that combine knowledge graphs and machine learning will be evaluated through clinical studies.
For those who have not yet delved deeply into the topic of Data Science: How would you explain to a child what you are working on?
We are working on a project to help doctors detect a health condition called diabetes much earlier than they can right now. Diabetes is a severe condition where the body has trouble regulating sugar levels in blood and discovering it early can make a difference in how well doctors can treat it. To do this, we are gathering different types of information about people, like their medical history, how active they are, and even details about the genes they express. We put all this information together into a kind of organized map. Then, we use computer programs called “machine learning” algorithms to analyze this map and find hidden patterns or clues that could indicate someone might be at risk for diabetes in the future.
Everyone talks about Data Science – how would you describe the importance of the topic for yourself in three words?
Data empowers Knowledge.
What points of contact with Data Science does your work have? Which methods do you already use, and which would be interesting for you in the future?
So far, I have been working with diabetes-related gene expression datasets since this type of data can help us understand the critical pathways and regulatory mechanisms of diabetes. While gene expression datasets are readily accessible in public databases, and this type of data is powerful for identifying disease-associated genes, a challenge lies in the limited sample size of such datasets. This constraint hinders the effectiveness of supervised machine learning methods, which are data-driven and require many labeled data for effective training and performance. For this reason, we are developing an approach capable of integrating several gene expression datasets into a knowledge graph and enriching it with domain-specific knowledge. Embedding methods are used to learn patient representations, which are then fed to a classical machine learning method, such as a decision tree. In the future, since graph neural networks have gained substantial traction, we aim to investigate how these architectures explicitly designed for graph structures can be used.
How high is the value of Data Science for your work? Would your research even be possible without Data Science?
Data science is incredibly valuable for our project and, I dare say, important for most of the research done these days. Every step of the process, from data processing and integration to model training and validation, benefits from data science methods. Data science enables us to conduct meaningful research and develop accurate predictive models for diabetes risk assessment.
What development opportunities do you see for the topic of Data Science in relation to your field?
In the last few years, we have witnessed an increase in the amount of biological data being collected and accumulated due to advancements in existing technologies and the emergence of new ones. Given the exponential growth of biological data, there is an urgent need to develop efficient computational approaches to assist humans in extracting useful information, i.e. knowledge, from these vast and often complex volumes of data. However, applying these tools in real-world scenarios requires explainability to understand the mechanisms underlying the predicted natural phenomena or to distinguish between meaningful predictions and spurious correlations. Therefore, I believe that the development of explainable artificial intelligence approaches will gain traction in the future as a potential solution to ensure that algorithms and their predictions are human-understandable.