SEOUL, SOUTH KOREA—3billion has announced 3Cnet, a pathogenicity prediction tool based on recurrent neural networks (RNN).
3Cnet boasts a top-1 recall of 14.5%–approximately 2.2 times greater than REVEL’s leading top-1 recall of 6.5%–when used to retrieve pathogenic variants in 111 patient cases diagnosed according to the ACMG guidelines, respectively.
3Cnet’s fundamental approach utilizes protein sequences and related characteristics. It uses long short-term memory (LSTM) layers in its architecture.
Performance is additionally verified using ClinVar data newly uploaded since training. 3Cnet performs well not only in a relative comparison against similar algorithms, but also in forecasting revisions in ClinVar variant pathogenicity.
3Cnet is derived from the “three C’s” approach while creating the model.
- Utilizes multiple sequence alignment (UniRef30) to generate simulated mutation data. When seeking data to compare against, 3Cnet looks towards nature itself.
- Simulated variants, which are well-conserved according to MSA results, are labeled as benign. Those which are not, are labeled as pathogenic.
- Simulated variants were then used to train 3Cnet.
- Over 2,500,000 simulated Pathogenic Variants trained
- Over 1,500,000 simulated Benign Variants trained
- Derived from utilization of the ClinVar database.
- Derived from utilization of the gnomAD database.
- Variants with high allele frequencies (variants commonly found in the population) are labeled as benign.
A distinctive feature of 3Cnet is its use of multi-task learning. Although training data generated using conservation information are not strictly evaluated based on clinical cases, the mutations simulated using evolutionary and non-human data implicitly embody patterns universal to amino acid chains which are purposeful to the task at hand.
By designing 3Cnet to have weights which are shared, and others that are partitioned, it is able to learn the universal patterns within the shared layers, while also learning human-specific insights within its partitioned layers.
Plans to increase the model’s ability to handle longer sequences, as well as speed and performance, are planned.
An accepted manuscript can be found here, published in Oxford University Press.