What could we do with bio big data constructed from clinical and genetic information of more than a million people? What is the value of rare diseases in bio big data?

Bio Big Data: Value and Limitations

Bio big data refers to the genetic and clinical information of many people. By using the collected data, genetic variants associated with human phenotype and disease can be identified. And the identified variants can provide insight into how to prevent and treat diseases.

In this context, bio big data, composed of genetic and clinical information of people, have been collected and researched, and more than 300,000 associations between variants and phenotypes have been reported so far in more than 5,500 papers.1


Reported variants, GWAS catalog

However, bio big data collected through individual studies have limited value. The reason is that there are some factors excluded from the collected data.
There are several reasons for the excluded factors. The factors may not be required for the study, not considered important at the initial time of research, and/or due to limited budget. So the excluded factors may reduce data utility. For example, factors collected in diabetes studies may have limited applicability in studies of mental illness. And even if a new diabetes-related gene candidate is discovered, that candidate cannot be assessed if the pre-existing data does not have such gene information. In other words, there is a limit to using the bio big data produced in each study for the research of other diseases.

Bio Big Data project and rare disease

National bio big data projects have emerged from this background. Those projects collected and managed vast amounts of various genetic information, clinical information, and experimental samples on a national level and maximized usability of the collected bio big data by providing them to researchers.

Well-known national projects that have collected genetic information, clinical information, and experimental samples from hundreds of thousands of people include UK Biobank2 in the UK, BioBank Japan3 in Japan, and All of Us4 in the United States.

Those bio big data projects are proceeding with rare diseases as one of the important goals. Rare diseases can quickly show direct outcomes through bio big data projects. This is because rare diseases are mainly caused by single genetic variants (unlike complex diseases such as diabetes and cardiovascular diseases affected by various genetic variants and the environment).5

Therefore, it is possible to diagnose rare patients using the collected bio big data, which enables projects to verify the quality of the collected bio big data. In addition, the value of the collected bio big data can be increased by improving the analysis method for utilizing bio big data in the process of patient diagnosis and data verification.


The genetic etiology of disease (https://www.nature.com/articles/ejhg2010249)

The UK's 100,000 Genomes Project is a project that has saved social costs and provided clinical welfare to people by focusing on rare diseases. This is a project being carried out by Genomics England, a company founded by the Department of Health and Social Care, which controls health and welfare in the UK.

According to a recent paper they published, genetic disease diagnosis analysis was conducted on 2,183 families consisting of 4,660 people with rare diseases, and 25% of patients were diagnosed. Through this, rare disease patients' diagnostic odysseys (median 75 months, 68 visits to hospitals) and medical expenses ($122 million) have been ended and saved, and 25% of diagnosed patients experienced changes in clinical decisions, such as drug changes, clinical trial participation, and clinical management changes, as a result of the diagnosis.6


100,000 Genomes Pilot (https://www.nejm.org/doi/full/10.1056/NEJMoa2035790)

The National Project of Bio Big Data (KDNA) Project: Goals and Status

The National Project of Bio Big Data (KDNA), which started in Korea, aims to collect clinical and genetic information to create a research platform and collect data from more than 1 million people.

The initial step of the project aims to collect samples from 5,000 rare disease patients and their families for a total of 15,000 samples. There are a couple reasons for focusing on rare diseases in the first stage; first, the field of rare diseases requires national medical support and a large-scale approach, and second, analysis of bio big data can have a great effect in diagnosing and treating rare diseases.

Data collection and analysis guidelines will be established through this project, and clinician opinion and treatment support will be provided to participating patients.

The Future of National Bio Big Data Projects

The bio-industry, which includes medical and pharmaceutical technology, is a high-value industry with rapid growth and welfare to human health.

As bio big data is gathered and the analysis becomes more advanced, treatment and prevention will be possible beyond diagnosis. However, the collection of bio big data, a resource that forms the basis of the bio-industry, should also consider diversity and quality, not just quantity.

In addition, bio big data must be utilized efficiently and safely for the development of biotechnology. This requires effective analysis tools, detailed guidelines, and a system for easy data access by companies and research institutions. On the other hand, security will be an important issue since bio big data deals with human information, and hence the importance of data security will emerge as utilization increases.


  1. GWAS Catalog. https://www.ebi.ac.uk/gwas/home
  2. Bycroft, C., Freeman, C., Petkova, D. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203-209(2018).
  3. Nagai A, Hirata M, Kamatani Y, et al. Overview of the BioBank Japan Project: Study design and profile. J Epidemiol. 2017;27(3S): S2-S8.
  4. The All of Us Research Program Investigators. The "All of Us" Research Program. N Engl J Med. 2019;381:668-676
  5. Becker, F., van El, C., Ibarreta, D. et al. Genetic testing and common disorders in a public health framework: how to assess relevance and possibilities. Eur J Hum Genet 19, S6–S44 (2011).
  6. The 100,000 Genomes Project Pilot Investigators. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care — Preliminary Report. N Engl J Med. 2021; 385:1868-1880

Do you want to keep up with the latest post of this series?