Welcome to the
Pequeno Príncipe Big Data
Big Data in bioinformatics and georeferencing
Our five Big Data groups at Pelé Pequeno Príncipe Research Institute are dedicated to exploring the intersection of Big Data in two fascinating fields: Bioinformatics and georeferencing.
In this digital era, data has become an invaluable asset for researchers in biology, medicine, and industries. Since the publication of the Human Genome Project in 2001, researchers have profited from the growth of datasets in genebanks and explored the potential applications of georeferencing datasets in tackling epidemiological challenges in public health, using powerful registries such as those from the Centers for Disease Control and Prevention (CDC) and the Surveillance, Epidemiology, and End Results (SEER) Program of the U.S. National Cancer Institute.
Our goal in Big Data projects is to develop or use existing pipelines to mine data that helps us understand pediatric disease mechanisms and how they spread across different regions.
This project has the support of Behring Foundation. Our gratitude for this important partnership!

Bonald Cavalcante de Figueiredo
Scientific director of Pelé Pequeno PríncipeResearch Institute

About us
The Pelé Pequeno Príncipe Research Institute (IPP, abbreviation in Portuguese) is part of the Pequeno Príncipe Complex, which includes Pequeno Príncipe Hospital, the largest and most comprehensive pediatric hospital in Brazil, and Pequeno Príncipe College, specialized in health training. It is located in Curitiba (Paraná state), Brazil.
Over the years, the Research Institute has focused on improvingearly and differential diagnoses, assisting in treatment, identifying the risks of disease recurrence, and increasing the chances of a cure for children and adolescents.
Another major contribution of the Institute is the training of professionals through the Master’s and PhD Program in Biotechnology Applied to Child and Adolescent Health, developed in partnership with Pequeno Príncipe College. Since its implementation, the program has graduated 180 master’s and PhD students.
Currently, the Research Institute has 15 main researchers and 97 ongoing research projects. In the last four years, the Institute contributed with 324 scientific articles.
The consolidation of an exclusive research unit was strengthened by its association with King Pelé. The Pelé Pequeno Príncipe Research Institute is the only social project in the world formally supported by Edson Arantes do Nascimento. Pelé lent his name because he believed in the Complex’s ability to transform the lives of children and adolescents.
What is Big Data?
Big Data is an environment for producing, collaborating, and sharing data aimed at scientific research. It is particularly important for research projects related to bioinformatics and georeferencing. Ideally, each specialty should have its specific database portfolio tailored to diseases of interest.
Big Data also functions as a teaching and training network, aligning with one of the core focuses of Pequeno Príncipe, which is dedicated to health assistance, teaching, and research.
As the field of bioinformatics continues to evolve, the significance of Big Data becomes increasingly apparent. The enormous amount of genetic information available in genomic databases presents both opportunities and challenges. By leveraging advanced data analysis techniques, researchers can extract valuable insights from these vast datasets, uncover patterns, identify genetic variations associated with diseases, accelerate drug discovery processes, and improve treatments.
Machine learning algorithms, for example, can be trained on massive genomic datasets to develop predictive models for disease diagnosis and prognosis. The integration of multiple datasets from diverse sources enables researchers to perform comprehensive analyses and generate detailed biological networks, shedding light on complex biological processes and interactions.
From 2001 to 2024, the volume of data increased exponentially.

Over time, new computing scripts have emerged. The most commonly used databases are mainly from the United States and some European countries. The following graph shows the exponential growth in data volume.

Big Data in GeneBank

Our goal and methodology
Bioinformatics is a multidisciplinary field that combines biology, computer science, and statistics and has evolved rapidly due to the exponential growth of genomic data. The completion of the Human Genome Project (HGP) marked a major milestone, providing researchers with a reference sequence of the entire human genome, with data growth predicted to double every 7–12 months (Nature, volume 527, 2015). Since 2001, numerous genomic databases have emerged, serving as valuable resources for understanding genetic variations, disease mechanisms, and drug targets.
This monumental achievement of the HGP provided scientists with an extensive blueprint of the human genetic code, offering unprecedented insights into human biology and disease. Since then,genomic research has witnessed exponential growth, leading to the creation of numerous genomic databases housing vast amounts of genetic information.
These genomic databases, such as the U.S. National Center for Biotechnology Information (NCBI) GenBank and the European Bioinformatics Institute (EBI), have become essential resources for researchers worldwide. They allow scientists to store, analyze, and share genomic data, facilitating groundbreaking discoveries and advancements in fields such as personalized medicine, genetic counseling, and evolutionary biology.
Importance of using georeferencing
Georeferencing allows researchers to combine health and environmental data with geographic coordinates to indicate location on maps. In the fields of epidemiology and healthcare,georeferencing is crucial for analyzing the spatial patterns of diseases, identifying high-risk areas, and implementing effective interventions.
Using georeferenced epidemiological data, researchers can visualize the spatial distribution of disease rates and risk factors on maps. By applying statistical and artificial intelligence methods to the georeferenced data, researchers can identify geographical clusters and potential environmental factors contributing to disease outbreaks. This resource empowers public health collaborators to develop targeted strategies, efficiently allocate resources, and implement timely interventions to mitigate the spread of diseases.
Objectives
The objectives of the Pelé Pequeno Príncipe Research Institute in using Big Data are:
- To introduce students to concepts, practices, methods, and emerging technologies in the areas of bioinformatics, georeferencing, and computational biology. These activities have the collaboration of Professor Mauro Castro, from the Federal University of Paraná.
- To develop or utilize available data mining pipelines to help researchers understand the mechanisms of pediatric diseases and how they spread across different regions.
- To contribute to the development of new diagnostic and treatment methods, as well as the discovery of new medicines through scientific studies.
- To improve the health and quality of life of children and adolescents, not only in Brazil but worldwide, through the results of research projects.
Technologies
Several tools are used in the development of scientific research involving Big Data. These include:
- Programming languages in biological data analysis;
- Algorithms and analysis flows applied to biological problems;
- Online generation, analysis, and treatment of biological data;
- Application of computational methods in the investigation of biological systems;
- Georeferencing systems; and
- Genomic and transcriptomic (related to RNA) databases.
Researchers from Pelé Pequeno Príncipe Research Institute develop their own databases, as well as access data from various platforms, including free databases managed by other institutions and those developed within research centers. The most commonly used genomic and transcriptomic databases at the Pelé Pequeno Príncipe Research Institute include the Cancer Genome Atlas (TCGA)*, St. Jude Cloud (the St. Jude Children’s Research Hospital database), and the Research Institute’s own databases (WES and WGS). *The Cancer Genome Atlas (TCGA) is a large project aimed at cataloging datasets on mutations and RNA-Seq of cancer through genome sequencing and bioinformatics.
Partnerships
The five independent Big Data groups at the Pelé Pequeno Príncipe Research Institute share similar goals applied to various areas of biology, medicine, and spatial epidemiology, exploring datasets generated by the Institute’s own research or by authors located in other countries.
The Research Institute counts on important partnerships in the development of scientific studies related to Big Data, all of which are nationally and internationally renowned. They include:
- Behring Foundation;
- Pequeno Príncipe Hospital;
- Pequeno Príncipe College, the educational unit of Pequeno Príncipe Complex;
- Federal University of Paraná (UFPR), Brazil;
- Institut de Pharmacologie Moléculaire et Cellulaire (IPMC), France;
- Thales Group, France;
- St. Jude Children’s Research Hospital, USA;
- Federal University of São Paulo, Brazil;
- U.S. Food and Drug Administration (FDA);
- U.S. National Institutes of Health (NIH);
- Georgetown University Medical Center, USA;
- COVID Human Genetic Effort, a global initiative;
- Paraná Sanitation Company (Sanepar), Brazil; and
- Oswaldo Cruz Foundation (Fiocruz), Brazil.
Network of connections
At the moment, Pelé Pequeno Príncipe Research Institute counts on five important partnerships.

The future of Big Data
Over time, the complexity of Big Data has increased, both in terms of speed and variability of information, which has been growing exponentially.
With so much information, the importance of Big Data stands out, because it is used to store data and people can learn how to interpret and apply it. For example, data can help scientists understand where and how a disease emerged. This is made possible through gene sequencing. The result contributes to the early diagnosis of the disease and guides the best treatment for the patient.
What will be the future of Big Data? One thing is certain: data volumes will continue to increase. According to predictions from experts, besides the data volume increasing, machine learning will continue to change the landscape of scientific research; data scientists and chief data officers (CDOs) will be in high demand; data will be processed faster and faster; and actionable data will come to the forefront.
For the next 25 years, we can expect an exponential increase in data, surpassing the amount of information that has emerged since the release of the Human Genome Project in 2003. This will allow researchers to collect data faster and with greater precision.

Importance of Big Data
Big Data has made significant contributions to various scientific fields, including personalized medicine. It not only provides critical information for diagnosing diseases but also suggests possible therapies for patients based on data-driven insights.
In the area of genetics and biology, Big Data plays a crucial role in precision medicine. It enables researchers to quickly and accurately analyze the genetic makeup of many individuals, with or without a determined disease.
Terabytes of data (DNA, RNA, and proteins) are available for cross-referencing specific parameters related to each disease. In other words, there are improvements in volume, speed, and veracity.
These advancements make it possible to plan and develop therapies, including the creation of new medicines. The available data play a critical role in the design of cell and gene therapies.
In the specific case of the Pelé Pequeno Príncipe Research Institute, Big Data can be associated with innovations related to scientific bench studies across its seven lines of research, withprojects focused on the diagnosis, prognosis, prevention, and treatment of diseases.
Ultimately, scientific research related to Big Data points to the need to establish small and large networks to accelerate more sustainable research, which is one of the goals of the Pequeno Príncipe Complex unit.
Big Data simplified
Learn more about some terms related to Big Data.
With information from the Brazilian Ministry of Management and Innovation in Public Services; the Secretariat of Digital Government, of the Brazilian Federal Government; IBM; Pontifical Catholic University of Paraná (PUCPR, abbreviation in Portuguese); the U.S. National Cancer Institute; the U.S. National Library of Medicine; and the American Medical Informatics Association (AMIA).
Algorithm
It is a set of instructions designed to achieve specific computational tasks, making it an essential element in computer programming. Algorithms can be used for various purposes, such as performing calculations or retrieving information from databases.
Artificial Intelligence (AI)
It is a field of computer science that aims to develop systems or algorithms capable of performing tasks that normally require human intelligence. AI uses techniques such as machine learning, natural language processing, and computer vision to acquire knowledge, learn from data, and make autonomous decisions. Its objective is to simulate the capacity for reasoning, adaptation, and problem-solving, contributing to advancements in areas, such as automation, health, and transportation.
Big Data
A large volume of data, which includes not only structured data (such as tables), but also semi-structured or unstructured data, such as images, texts, and sounds. This data has the potential to be explored in an interrelated manner to obtain information. Given the complexity and volume of data, it requires substantial processing power.
Bioinformatics
It is a multidisciplinary field that combines biology, computer science, and statistics. It has evolved rapidly due to the exponential growth of genomic data. Bioinformatics tools can provide useful insights to answer some questions about diseases, including those that affect children and adolescents.
Data-driven
Data-driven processes guide decision-making and organizational planning through data utilization.
Data pipeline
A data pipeline is a system in which raw data from various sources are reunited, transformed, and transferred to a data storage space for analysis. Various types of data pipelines exist, each designed for specific activities, such as batch processing or streaming data.
Data science
It is a multidisciplinary field that involves computational, statistical, and mathematical techniques. It aims to solve complex problems using large datasets.
Georeferencing
In epidemiology and healthcare, georeferencing plays a crucial role in analyzing disease spatial patterns, identifying high-risk areas, and implementing effective interventions. Researchers use georeferenced epidemiological data to visualize the spatial distribution of disease rates and risk factors on maps, identify geographical clusters, and examine potential environmental factors contributing to disease outbreaks.
Health informatics
Health informatics (HI), also known as medical informatics, applies principles of computer and information science to advance life sciences research, health professional education, public health, and patient care. It is a multidisciplinary field that has its focus on health technologies to improve human health and healthcare services. To put this into practice, it uses biomedical data and tools involving computer, cognitive, and social sciences.
Information Technology (IT)
It is a set of technological resources for obtaining, processing, and generating information that is made accessible through communication networks. Through the application of software development resources, it provides functionalities to the hardware, which, integrated into the communications system, offers services to society.
Internet of Things
The expression is used to designate advances in connectivity and interaction between various types of everyday objects equipped with sensors and internet communication capabilities. This creates a common virtual environment and allows remote control, the use of automatic commands, and even integration between them. This connectivity also generates a large amount of data in people’s daily lives.
LGPD
The Brazilian General Data Protection Law (LGPD, abbreviation in Portuguese), Law No. 13,709/2018, establishes rules for the collection, storage, processing, and sharing of personal data in the country, ensuring greater protection for all people.
Machine learning (ML)
Machine learning is a branch of artificial intelligence (AI) and computer science that focuses on using data and algorithms to enable machines to imitate the way that humans learn. ML contributes to decision-making and model optimization processes, gradually improving its accuracy.
The Cancer Genome Atlas Program
The Cancer Genome Atlas (also known as TCGA) is a landmark cancer genomics program that molecularly characterized over 20,000 primary cancers and matched normal samples spanning 33 cancer types. Created in 2006, it has a multidisciplinary team made up of researchers from several institutions. This joint effort between the U.S. National Cancer Institute and the U.S. National Human Genome Research Institute has generated petabytes of genomic, epigenomic, transcriptomic, and proteomic data. These data have contributed to improved diagnosis and treatments and are available to the global research community. To learn more about the TCGA, please click here.
Translational medicine
Translational medicine seeks to bring scientific discoveries into healthcare practice and provide feedback to research based on the main challenges that healthcare professionals face daily. In this way, it provides several clinical benefits, such as new forms of diagnosis and treatment; multidisciplinary knowledge between areas; greater accessibility to healthcare for the community; lower cost and greater effectiveness of medications, which also generates economic value; and support for the development of public policies.
The Big Data projects at the Pelé Pequeno Príncipe Research Institute benefit from the support of partners, investors, and the community. Check out who supports the scientific studies of the Pequeno Príncipe Complex unit. To all, we express our gratitude.
Investor

Partners












