书名：Scala Machine Learning Projects
作者名：Md. Rezaul Karim
本章字数：454字
更新时间：2025-02-14 21:30:16

Machine learning for genetic variants

Research has revealed that population groups from Asia, Europe, Africa, and America can be separated based on their genomic data. However, it is more challenging to accurately predict the haplogroup and the continent of origin, that is, geography, ethnicity, and language. Other research shows that the Y chromosome lineage can be geographically localized, forming the evidence for (geographically) clustering the human alleles of the human genotypes.

Thus, the clustering of individuals is correlated with geographic origin and ancestry. Since race depends on ancestry as well, the clusters are also correlated with the more traditional concepts of race, but the correlation is not perfect since genetic variation occurs according to probabilistic principles. Therefore, it does not follow a continuous distribution in different races and rather overlaps across or spills into different populations.

As a result, the identification of ancestry, or even race, may prove to be useful for biomedical reasons, but any direct assessment of disease-related genetic variation will ultimately yield more accurate and beneficial information.

The datasets provided by various genomics projects, such as The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC), 1000 Genomes Projects, and Personal Genome Project (PGP), dispose of large-scale data. For fast processing of such data, ADAM and Spark-based solutions have been proposed and are now widely used in genomics data analytics research.

Spark forms the most efficient data-processing framework and, in addition, provides primitives for in-memory cluster computing, for example, for querying the user data repeatedly. This makes Spark an excellent candidate for machine learning algorithms that outperform the Hadoop-based MapReduce framework. By using the genetic variants dataset from the 1000 Genomes project, we will try to answer the following questions:

How is human genetic variation distributed geographically among different population groups?
Can we use the genomic profile of individuals to attribute them to specific populations or derive disease susceptibility from their nucleotide haplotype?
Is the individual's genomic data suitable to predict geographic origin (that is, the population group for an individual)?

In this project, we addressed the preceding questions in a scalable and more efficient way. Particularly, we examined how we applied Spark and ADAM for large-scale data processing, H2O for K-means clustering of the whole population to determine inter- and intra-population groups, and MLP-based supervised learning by tuning more hyperparameters to more accurately predict the population group for an individual according to the individual's genomic data. Do not worry at this point; we will provide the technical details on working with these technologies in a later section.

However, before getting started, let's take a brief journey to the 1000 Genomes Project dataset to provide you with some justification on why interoperating these technologies is really important.