Natural Language Processing and Text Mining for Interpretation of Genetic Variants

Insights | 23. 01. 11

Introduction

How many minutes does it usually take you to read news, blog posts, abstracts of research papers such as in Cell or Nature? If you are an expert in the field, it may only take a minute, and if you are not, perhaps even more than 4 minutes. Similarly, for the interpretation of genetic variants, it may take a clinician 1 to 4 minutes to find and read a publication about the genetic variant. Through whole exome sequencing (WES), about 70,000 to 100,000 variants are found in one patient. In this case, how long would it take to read publications for each genetic variant? Even if you read one publication every two minutes, it would take 97 days. If you only focus on reading publications, you may not have time to provide clinical treatment and end up only seeing 3 to 4 patients a year. Thus, it would require a lot of time to determine the pathogenicity of a patient’s genetic variants and whether they are associated with the patient’s phenotype by reading the free text within publications. How can we address this problem technically?

1. Velocity and volume of data related NLP for human genetic variants

MEDLINE (a medical information online database) contains as many as 33 million biomedical articles. The number of PubMed and PubMed Central (PMC) publications regarding genetic variants are increasing every year, and there are 3,000 new publications every day. Similar to the example above, if a clinician finds that there is a mutation in the TBR1 gene in a patient with intellectual developmental disorder with autism and speech delay, they would search PubMed for TBR1 and retrieve 324 papers (as of 1/11/2021). In addition, more and more research papers are being published as the cost of genetic testing has decreased. How should we follow up with the latest information? Natural language processing may solve these problems.

2. Concept of NLP and text mining

Natural language refers to a human readable language written without any structure, and natural language processing refers to a processing technique which handles natural language. Text-mining is considered a subfield of data mining and means that the data to be processed is text. Natural language processing is one of the techniques of text mining, which is the process of obtaining information that we want from text, and these two concepts (natural language processing, text mining) are often used interchangeably. Both technologies continue to be advanced and are a well-established field that is widely used across industries. Recently, many studies are conducted with bioinformatics or translational bioinformatics combined with NLP. It has attracted more attention recently because of the opportunity for expansion of research¹.

3. What are the opportunities from NLP?

NLP is an essential processing technique for extracting and processing natural language that researchers want from various databases. Depending on the databases that the researcher intends to use and purpose of research, there are many opportunities to improve biomedical informatics. Below are two examples of the many opportunities in the biomedical field using NLP.

1) Improving information retrieval and extraction

Natural language processing can sanitize notation patterns for denormalized genetic variants, diseases, phenotypes, and compounds into standard nomenclature, leading to more efficient searches. There can be a lot of variability between authors (inter-writer variability) because natural languages are human-written languages. Sometimes, one author can write the same concept multiple times all in different ways (within-writer variability). To address this issue, ontology and nomenclature have been developed to define each concept (e.g., naming for a genetic variant, a disease, or a phenotype). Nevertheless, if these ontology and nomenclature are not used, there may be many differences in expressing concepts. In this case, the information you are looking for may not be searchable because of mismatched keywords. With mutations that are not expressed in standardized forms (e.g. RSID, HGVS) in 25% of cases, there is the possibility that researchers miss the search for the desired mutation (Figure 1)². This is because the notations can be written in various ways as shown in the table below (Table 1).

Figure 1. Number of publications that contain genomic variants in PubMed and PMC

Type	Example	Numbers	Proportion
WildNumberMutation	V600E	1,107,672	44.95
RSID	Rs113488022	611,733	24.82
c.NumberAmio>Amino	c.1799T>A	135,510	5.50
p. WildNumberMutation	p.V600E	105,921	4.30
NumberAmio>Amino	1799T>A	98,228	3.21

Table 1. Top 5 Genomic variant notations found in the bodies of literature in PubMed and PMC open access (reference: Lee at al (2021))

To address this problem, NLP can help find out the main patterns or recognize some notation of a variant as a variant through a technique such as Named Entity Recognition (NER). Also, frequently identified patterns in notation can be used as rules. A good example of such a study is PubTator, which has various AI models including BERT (Bidirectional Encoder Representations from Transformers) ³. It is web service that recognizes concepts for the words in documents (e.g., gene, disease, chemical, mutation, species, cell line). Similarly, several studies have been conducted to extract information, such as through Phenotagger, which recognizes individual names for phenotypes.

Figure 2. Example of PubTator use

2) CDSS (Clinical Decision Support Systems) for variant interpretation

Interpretation of genetic variants should be accompanied by the matching of the patient’s phenotype and genotype in the case report. Therefore, it is important to understand the mutations and symptoms reported in previous literature. However, this requires labor-intensive efforts. In order to overcome this, a database that has been curated by humans was developed, HGMD. However, such an approach would not be sufficient to curate each rare disease and new research papers that are published every day.

For clinical application, it would be more efficient for a clinician to interpret by reading the literature that is highly related to the patient’s variant, rather than looking at all literature. It is similar to recommending AI speakers that fit the customer’s preferences rather than searching for hundreds of AI speakers on Amazon and comparing product specifications one by one. A similar attempt was made through AMELIE in biomedical research. AMELIE takes the patient’s VCF (variant call format) and symptoms as input values⁴. It also finds papers about similar symptoms to the patient’s symptoms in the VCF variants, and order the papers by priority. This is a good example of how natural language processing can reduce the burden on clinicians.

In addition to these examples, there may be more opportunities depending on data, method, and purpose. Therefore, an accurate understanding of the technology and a clear definition of the research purpose will provide more diverse opportunities for natural language processing.

4. Challenges

With the recent development of artificial neural networks, NLP has evolved into language models that can learn and process words, sentences, or documents, and even the semantic expression of words. Since BERT in 2018, many sub-language models (BioBERT, PubMed BERT, etc.) have also been developed in the field of bioinformatics. However, there are still many challenges and limitations to be resolved, and many studies are being conducted for this purpose. Below are some of the limitations and research on NLP.

Recent studies mainly use natural language processing using AI, so they have the limitations of AI itself. 1) It is still necessary to produce labeled data at a fairly high cost. In addition, there is a problem in sharing labeled patient data due to privacy, because the supervised learning algorithm requires data with labels such as genomic data with a diagnosis code. If you want to develop an NLP model for rare diseases, it would be difficult to have a language model that has learned a lot of rare disease data. For this, privacy preserving AI and few shot learning are being studied. 2) NLP with AI has the limitation that a general model shows lower performance than a task-specific model. To address it, research on task-agnostic biomedical language models is being conducted. 3) It is still difficult to explain “why this result came out” in natural language processing using artificial neural networks⁵. It is important for end-users such as the patient or clinician to interpret and understand the AI system because erroneous predictions are especially risky and expensive. So, a study on eXplainable AI (XAI) is being actively researched. Indeed, in terms of pattern recognition, there is not much difference between believing the results of the breast cancer pathology image classified by pigeons and believing the results classified by artificial intelligence (Figure 3). XAI ultimately aims to become a responsible AI. It is important to provide rationale that users can fully accept to understand the internal algorithm transparently.

Figure 3. Experiment to classify images of malignant and benign tumors in pathological images of breast cancer using pigeons ⁶

5. Conclusion

We introduced NLP for processing human readable text data mechanically. As the cost of sequencing dropped and new studies on genetic mutations are published, and as the amount of publications exceeded the threshold for human processing, many researchers started focusing on computer-based processing methods. The value that NLP can provide depends on what kind of problem you are facing and how you want to solve it. In addition to the two examples that were introduced, there are many more benefits of natural language processing technology. Recently, as natural language processing relies on artificial intelligence technology, it has the limitations of artificial intelligence inherently. Nevertheless, these natural language processing technologies are expected to contribute to the biomedical sciences and accelerate the progress of related technologies.

References

Cohen, K. B., & Hunter, L. E. (2013). Chapter 16: text mining for translational bioinformatics. PLoS computational biology, 9(4), e1003044.
Lee, K., Wei, C. H., & Lu, Z. (2021). Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Briefings in bioinformatics, 22(3), bbaa142.
Wei, C. H., Allot, A., Leaman, R., & Lu, Z. (2019). PubTator central: automated concept annotation for biomedical full text articles. Nucleic acids research, 47(W1), W587-W593.
Birgmeier, J., Haeussler, M., Deisseroth, C. A., Steinberg, E. H., Jagadeesh, K. A., Ratner, A. J., … & Bejerano, G. (2020). AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Science translational medicine, 12(544).
Ahmad M, Eckert C, Teredesai A. Interpretable machine learning in health care. In: Proceedings of the ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018 Presented at: International Conference on Bioinformatics, Computational Biology, and Health Informatics; August 29 – September 1; Washington DC p. 559-560
Levenson, R. M., Krupinski, E. A., Navarro, V. M., & Wasserman, E. A. (2015). Pigeons (Columba livia) as trainable observers of pathology and radiology breast cancer images. PloS one, 10(11), e0141357.