The term “big data” has become commonplace. There are many types of big data – big data in finance, such as consumer gender and time of purchase used to customize marketing advertisements, big data in traffic management, such as the number of accidents in the city used to improve road safety, and big data in healthcare, such as medical images that an automatic medical image analysis system learns lesions from to facilitate medical decision-making.

Where is the largest amount of data produced in our daily lives?

YouTube is the world's largest video production and consumption platform. How much data is produced on Youtube? According to the data, about 1 to 2 EB (exabytes) of data is produced on Youtube per year. 1 EB equals 1,000 PB (petabytes) or 1,000,000 GB (gigabytes). The numbers are so large that we can't perceive them well, but they certainly seem like huge numbers.

Nevertheless, the largest big data around us is genomic data.

Genomic data refers to the genetic information within 3 billion bases in humans. The genomic sequence of the body is read for various purposes, such as tests to prescribe customized treatments, to determine the genetic risk of disease, or to diagnose rare genetic diseases. The amount of genomic data produced in this way is said to be about 220 million genomes (40 Exabytes) per year. That's 40 times higher than the YouTube data we looked at earlier.

In 2003, the Human Genome Project found the 3 billion pieces that make up the human body puzzle. Through this, mankind drew a rosy future where all diseases could be conquered, but it was not easy to solve the secrets of each 3 billion puzzle piece.

With the rapid development of technology, the cost of reading a person's genome has decreased immensely, and now it is possible to read a person's entire nucleotide sequence for about $1,000. However, quick and cheap sequencing was not the answer. Twenty years have passed since the Human Genome Project ended, but the problem still remains. What are the problems?

1. Computation time

Reading genomic data is not everything. After that, we need to process the data.

Currently, it is not possible to read the genome from position 1 to the end all at once, but the genome is cut into short pieces of about 400 bp and both ends are read with a length of about 150 bp. These are called FASTQ files. [Figure 1] These FASTQ files are then reassembled into long sequences by aligning them with the reference sequence. These files are called BAMs. [Figure 2] Then, out of the 3 billion positions in the genome, a position called a variant that is different from the reference sequence can be found. [Figure 3] Here, reading all 3 billion bases of the genome is called WGS (Whole Genome Sequencing), reading all protein-coding exons is called WES (Whole Exome Sequencing), and reading only specific gene positions is called Target Sequencing. When sequencing only a specific region, the variant calling process only covers mutations in that specific region. Files that refer to these specific areas are called BED files. [Figure 4] There are about 5 million mutations per person in the WGS VCF file, which is about 0.16% of the total 3 billion bp. Although it is called the same mutation, there are mutations that are visible on the outside and have no problems, such as human skin or hair color, while some mutations can cause diseases and make it difficult to live.

The form seen after opening the FASTQ file. One read consists of four lines: header, sequence, delimiter, and quality. The yield of WGS 30x is about 10 billion bp.
[Figure 1] The form seen after opening the FASTQ file. One read consists of four lines: header, sequence, delimiter, and quality. The yield of WGS 30x is about 10 billion bp.


BAM file viewed with IGV viewer. Each read is stacked against the reference sequence. Also, in the lower blue area, the top blue area represents the exon and intron of the MEN1 gene based on the Refseq Gene, and the bottom blue area is the bed area designed for each exon. Each exon is carefully designed.
[Figure 2] BAM file viewed with IGV viewer. Each read is stacked against the reference sequence. Also, in the lower blue area, the top blue area represents the exon and intron of the *MEN1* gene based on the Refseq Gene, and the bottom blue area is the bed area designed for each exon. Each exon is carefully designed.


Example of VCF file format. There are lines with metadata, headers, and variant data.
[Figure 3] Example of VCF file format. There are lines with metadata, headers, and variant data.


Example of BED file format. Each column consists of a chromosome, start, and end. In this format, segment information can be displayed in the genome.
[Figure 4] Example of BED file format. Each column consists of a chromosome, start, and end. In this format, segment information can be displayed in the genome.


It takes at least about a day, or 24 hours, on server-grade equipment to create VCF through BAM, from FASTQ, which is raw data of the entire genome. Analysis can also be performed on regular PCs we usually use, but the specification requires at least 16GB of memory and it takes about two weeks.

Analysis of exome sequencing data, which is a set of exon regions expressing proteins rather than whole genomes, takes about 2 hours.

2. Storage

When sequencing a person's genome data and extracting mutation information, the files of FASTQ, BAM, and VCF that we discussed earlier will be produced. Reading a person's 3 billion base sequences at least 30 times takes about 90 billion characters, even with simple calculations. [Table 1] shows this in terms of computer file size.

Type (Mean depth)FASTQBAMVCFSUM
WES (100x)5GB8GB0.1GB13GB
WGS (30x)80GB100GB1GB180GB

[Table 1] Size of WES and WGS genome data of one person.


In general, a movie with a running time of 135 minutes is about 3 GB. Then, a person's genome can be said to be about the size of 60 films. What if we save it? As of early 2022, you can buy a 1TB external hard disk for about $70. For $70, we can put about 5 people’s worth of human genome data. What about the cloud? AWS (Amazon Web Services), which is widely used, costs about $0.025/GB per month. If you calculate with one person's genome, it will cost $4.5/month, but if you consider 1000 people, the minimum number of samples in a population to filter out common variants, even simply storing it at $4500/month is quite a bit. [Table 2]

ProviderService namePricePrice per one WGS sample
AWSS3 Standard$0.025/GB$4.5
Google cloudCloud storage$0.023/GB$4.14
Microsoft AzurePremium$0.15/GB$27.0

[Table 2] SCloud computing data storage service price [1][2][3]


3. Data transfer

In general, when we say we use a fast network, we use gigabit internet. The gigabit internet refers to a speed of about 1 Gb (gigabit) per second, or 125 MB/s when converted into bytes that are familiar to us. This refers to the theoretical maximum speed. For convenience, let's assume that the maximum speed is achieved and calculate the transmission speed using the aforementioned genome data. With FASTQ + BAM + VCF = 180GB, it takes 180GB / 125MB/s = 1440 seconds = 24 minutes. At first glance, it may seem like a short amount of time. But in fact, the network we are using regulates speed when traffic is heavy. For this purpose, a rate limit is applied (QoS), and this traffic is about 100 GB, depending on the product. If 100GB is used up, it will be reduced to 100Mb/s, which is 1/10 of gigabit internet. That is, 100GB / 125MB/s + 80GB / 12.5MB/s = 2 hours. And if the number of samples is more than one, the second sample must be received at 100Mb/s, so it takes 180GB / 12.5MB/s = 4 hours. Not only is receiving a problem, but sending is also a problem. 180 GB is extremely different from a document file that is simply sent as an e-mail attachment. Sharing this data with collaborators is challenging, so, in practice, data is often stored in an external hard drive and transmitted entirely.

4. Variant interpretation

It is difficult to read, store, and transmit the genome data, but the most important step to analyze the genome data remains. It is difficult or impossible to receive raw data and analyze it on a personal PC. The better the specifications of the server to be analyzed, the faster it will be, but in general, if you use a 40-thread CPU and 250GB of RAM, it takes about 24 hours to create variant data from whole genome data. If you use cloud computing, you can calculate it as shown in [Table 3]. VCF itself contains information on about 5 million mutations, and among these numerous mutations, it is necessary to find a variant that is related to disease according to the clinical information.

ProviderService name*CPU (thread)RAM (GB)PricePrice per one WGS sample analysis (24h)**
AWSr5.8xlarge32256$2.016/h$48.384
Google Cloudc2-standard-6060240$2.51/h$60.24
Microsoft AzureE32a v432256$3.712/h$89.088

[Table 3] Cost of WGS Data Variant Extraction Using Cloud Computing [1][2][3]


  • Google Cloud measured with similar instance specs supporting 256 GB RAM
    ** Assume that one WGS sample is processed in 24 hours

Conclusion

As we saw earlier, there are many difficulties in dealing with large amounts of data in a small computing environment. Those who have received genetic data from sequencing companies will know, but it is not easy to find meaningful information in raw genetic data itself. Therefore, for the interpretation of genetic data, we recommend that you receive a report from a company with a lot of experience handling data, updating their database with the latest data every day, and fine tuning their diagnostic algorithm (e.g., 3billion). That way, you can focus efforts on the few variants extracted by 3billion to put an end to the diagnostic odyssey.

Nevertheless, if there are readers who would like to look at raw genomic data, in another post, we will list raw data frequently encountered in genomic data analysis and learn about the format and data structure.

References

  1. AWS. AWS Pricing Calculator
  2. Google Cloud. Google Cloud Pricing Calculator
  3. Microsoft Azure. Pricing Calculator