Medulloblastoma Subgroups Prediction.
MBMethPred
MBMethPred is a user-friendly package developed for the accurate prediction of medulloblastoma subgroups using DNA methylation beta values. It incorporates seven machine learning models, including Random Forest, K-Nearest Neighbors, Support Vector Machine, Linear Discriminant Analysis, Extreme Gradient Boosting, Naive Bayes, and a neural network model specifically designed for the complexities of medulloblastoma data. The package provides streamlined workflows for data preprocessing, feature selection, model training, cross-validation, and prediction. This vignette offers detailed explanations, examples, and resulting outputs for each functionality. The MBMethPred package was tested on an Ubuntu machine equipped with an Intel Core i5-6200U processor and 16GB RAM.
Citation
Sharif Rahmani E, Lawarde A, Lingasamy P, Moreno SV, Salumets A and Modhukur V (2023), MBMethPred: a computational framework for the accurate classification of childhood medulloblastoma subgroups using data integration and AI-based approaches. Front. Genet. 14:1233657. doi: 10.3389/fgene.2023.1233657
Installation
install.packages("MBMethPred")
or
remotes::install_github("sharifrahmanie/MBMethPred")
require(MBMethPred)
Input file for prediction
The ReadMethylFile is a function for reading DNA methylation beta values files and using them as new data for prediction by every model. The input for this function should be either CSV or TSV file format.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
write.csv(NewData, "NewData.csv", quote = FALSE, row.names = FALSE)
methyl <- ReadMethylFile(File = "NewData.csv")
This function has only one argument, the File. The first column of the File is the CpG methylation probe that starts with cg characters and is followed by a number (e.g., cg100091). Other columns are samples with methylation beta values. All columns in the data frame should have a name.
Box plot
The BoxPlot function draws a box plot out of DNA methylation beta values or other data frames.
Usage
data <- Data2[1:20,]
data <- cbind(rownames(data), data)
colnames(data)[1] <- "ID"
BoxPlot(File = data, Projname = NULL)
This function has two arguments as follows:
FileA data frame with the first column as ID.ProjnameA string to name the plot.
t-SNE 3D plot
The TSNEPlot function draws a 3D t-SNE plot for the DNA methylation dataset using the K-means clustering technique. This function has two arguments File (any matrices) and NCluster ( number of clusters for K-Means clustering).
Usage
data <- data.frame(t(Data2[1:100,]))
data <- cbind(rownames(data), data)
colnames(data)[1] <- "ID"
TSNEPlot(File = data, NCluster = 4)
An R window will appear with a 3D projection of the t-SNE result. The plot object can be saved with the next line of code.
rgl.snapshot('tsne3d.png', fmt = 'png')
Input file for similarity network fusion (SNF)
Using ReadSNFData function, one can read files (any matrices with CSV or TSV format) and feed them into the similarity network fusion (SNF) function (from the SNFtools package).
Usage
data(Data2) # Gene expression
Data2 <- cbind(rownames(Data2), Data2)
colnames(Data2)[1] <- "ID"
write.csv(Data2, "Data2.csv", row.names = FALSE)
Data2 <- ReadSNFData(File = "Data2.csv")
Similarity network fusion (SNF)
The SimilarityNetworkFusion is a function to perform the SNF function (from the SNFtool package) and output clusters.
Usage
data(RLabels) # Real labels
data(Data2) # Methylation
data(Data3) # Gene expression
snf <- SimilarityNetworkFusion(Files = list(Data2, Data3),
NNeighbors = 13,
Sigma = 0.75,
NClusters = 4,
CLabels = c("Group4", "SHH", "WNT", "Group3"),
RLabels = RLabels,
Niterations = 60)
snf
This function has several arguments as follows:
FilesA list of data frames created using the ReadSNFData function.NNeighborsThe number of nearest neighbors.SigmaThe variance for local model.NClustersThe number of clusters.CLabelsA string vector to name the clusters. Optional.RLabelsThe actual label of samples to calculate the Normalized Mutual Information (NMI) score. Optional.NiterationsThe number of iterations for the diffusion process.
Support vector machine model
The SupportVectorMachineModel is a function to train a support vector machine model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
Model metrics, including accuracy, precision, sensitivity F1-Score, specificity, and AUC_average can be calculated for the test dataset using the ModelMetrics function, which calculates the average of the above parameters from the result of the ConfusionMatrix function.
The prediction result on new data can be accessed through the NewDataPredictionResult function, which calculates every prediction's mode across the number of cross-validation folds.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
svm <- SupportVectorMachineModel(SplitRatio = 0.8,
CV = 10,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = svm)
NewDataPredictionResult(Model = svm)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
K nearest neighbor model
The KNearestNeighborModel is a function to train a K nearest neighbor model to classify medulloblastoma subgroups using DNA methylation beta values.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
knn <- KNearestNeighborModel(SplitRatio = 0.8,
CV = 10,
K = 3,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = knn)
NewDataPredictionResult(Model = knn)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.KThe number of nearest neighbors.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
Random forest model
The RandomForestModel is a function to train a random forest model to classify medulloblastoma subgroups using DNA methylation beta values.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
rf <- RandomForestModel(SplitRatio = 0.8,
CV = 10,
NTree = 100,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = rf)
NewDataPredictionResult(Model = rf)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.NTreeThe number of trees to be grown.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
XGBoost model
The XGBoostModel is a function to train an XGBoost model to classify medulloblastoma subgroups using DNA methylation beta values.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
xgboost <- XGBoostModel(SplitRatio = 0.8,
CV = 10,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = xgboost)
NewDataPredictionResult(Model = xgboost)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
Linear discriminant analysis model
The LinearDiscriminantAnalysisModel is a function to train a linear discriminant analysis model to classify medulloblastoma subgroups using DNA methylation beta values.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
lda <- LinearDiscriminantAnalysisModel(SplitRatio = 0.8,
CV = 10,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = lda)
NewDataPredictionResult(Model = lda)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
Naive Bayes model
The NaiveBayesModel is a function to train a Naive Bayes model to classify medulloblastoma subgroups using DNA methylation beta values.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
nb <- NaiveBayesModel(SplitRatio = 0.8,
CV = 10,
Threshold = 0.8,
NCores = 1,
NewData = NewData)
ModelMetrics(Model = nb)
NewDataPredictionResult(Model = nb)
This function has the following arguments:
SplitRatioTrain and test split ratio. A value greater or equal to zero and less than one.CVThe number of folds for cross-validation. It should be greater than one.ThresholdThe threshold for deciding class probability. A value greater or equal to zero and less than one.NCoresThe number of cores for parallel computing.NewDataA methylation beta values input from the ReadMethylFile function.
Artificial neural network model
The NeuralNetworkModel is a function to train an artificial neural network model to classify medulloblastoma subgroups using DNA methylation beta values. Please uncomment the following lines and run the function. If it is the first time you run this function, set the InstallTensorFlow parameter to TRUE. It will automatically install the Python and TensorFlow library (version 2.10-CPU) in a virtual environment and then set the parameter to FALSE.
Usage
set.seed(1234)
fac <- ncol(Data1)
NewData <- sample(data.frame(t(Data1[,-fac])),10)
NewData <- cbind(rownames(NewData), NewData)
colnames(NewData)[1] <- "ID"
ann <- NeuralNetworkModel(Epochs = 100,
NewData = NewData,
InstallTensorFlow = TRUE)
ModelMetrics(Model = ann)
NewDataPredictionResult(Model = ann)
This function has the following arguments:
EpochsThe number of epochs.NewDataA methylation beta values input from the ReadMethylFile function.InstallTensorFlowLogical. To run this function for the first time, you need to install the TensorFlow library (V 2.10-cpu). The default is TRUE.