Title: | Model Classifier for Binary Classification |
---|---|
Description: | A collection of tools that support data splitting, predictive modeling, and model evaluation. A typical function is to split a dataset into a training dataset and a test dataset. Then compare the data distribution of the two datasets. Another feature is to support the development of predictive models and to compare the performance of several predictive models, helping to select the best model. |
Authors: | Choonghyun Ryu [aut, cre] |
Maintainer: | Choonghyun Ryu <[email protected]> |
License: | GPL-2 |
Version: | 0.3.9 |
Built: | 2025-01-21 04:42:37 UTC |
Source: | https://github.com/choonghyunryu/alookr |
The cleanse() cleanse the dataset for classification modeling
## S3 method for class 'data.frame' cleanse( .data, uniq = TRUE, uniq_thres = 0.1, char = TRUE, missing = FALSE, verbose = TRUE, ... ) cleanse(.data, ...)
## S3 method for class 'data.frame' cleanse( .data, uniq = TRUE, uniq_thres = 0.1, char = TRUE, missing = FALSE, verbose = TRUE, ... ) cleanse(.data, ...)
.data |
a data.frame or a |
uniq |
logical. Set whether to remove the variables whose unique value is one. |
uniq_thres |
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value. |
char |
logical. Set the change the character to factor. |
missing |
logical. Set whether to removing variables including missing value |
verbose |
logical. Set whether to echo information to the console at runtime. |
... |
further arguments passed to or from other methods. |
This function is useful when fit the classification model. This function does the following.: Remove the variable with only one value. And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable. In this case, it is a variable that corresponds to an identifier or an identifier. And converts the character to factor.
An object of data.frame or train_df. and return value is an object of the same type as the .data argument.
# create sample dataset set.seed(123L) id <- sapply(1:1000, function(x) paste(c(sample(letters, 5), x), collapse = "")) year <- "2018" set.seed(123L) count <- sample(1:10, size = 1000, replace = TRUE) set.seed(123L) alpha <- sample(letters, size = 1000, replace = TRUE) set.seed(123L) flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE) dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE) # structure of dataset str(dat) # cleansing dataset newDat <- cleanse(dat) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, uniq = FALSE) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, uniq_thres = 0.3) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, char = FALSE) # structure of cleansing dataset str(newDat)
# create sample dataset set.seed(123L) id <- sapply(1:1000, function(x) paste(c(sample(letters, 5), x), collapse = "")) year <- "2018" set.seed(123L) count <- sample(1:10, size = 1000, replace = TRUE) set.seed(123L) alpha <- sample(letters, size = 1000, replace = TRUE) set.seed(123L) flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE) dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE) # structure of dataset str(dat) # cleansing dataset newDat <- cleanse(dat) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, uniq = FALSE) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, uniq_thres = 0.3) # structure of cleansing dataset str(newDat) # cleansing dataset newDat <- cleanse(dat, char = FALSE) # structure of cleansing dataset str(newDat)
Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class. and cleansing the "split_df" class
## S3 method for class 'split_df' cleanse(.data, add_character = FALSE, uniq_thres = 0.9, missing = FALSE, ...)
## S3 method for class 'split_df' cleanse(.data, add_character = FALSE, uniq_thres = 0.9, missing = FALSE, ...)
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
add_character |
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables. |
uniq_thres |
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value. |
missing |
logical. Set whether to removing variables including missing value |
... |
further arguments passed to or from other methods. |
Remove the detected variables from the diagnosis using the compare_diag() function.
An object of class "split_df".
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% cleanse
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% cleanse
Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.
compare_diag( .data, add_character = FALSE, uniq_thres = 0.01, miss_msg = TRUE, verbose = TRUE )
compare_diag( .data, add_character = FALSE, uniq_thres = 0.01, miss_msg = TRUE, verbose = TRUE )
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
add_character |
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables. |
uniq_thres |
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value. |
miss_msg |
logical. Set whether to output a message when diagnosing missing value. |
verbose |
logical. Set whether to echo information to the console at runtime. |
In the two split datasets, a variable with a single value, a variable with a level not found in any dataset, and a variable with a high ratio to the number of levels are diagnosed.
list. Variables of tbl_df for first component named "single_value":
variables : character. variable name
train_uniq : character. the type of unique value in train set. it is divided into "single" and "multi".
test_uniq : character. the type of unique value in test set. it is divided into "single" and "multi".
Variables of tbl_df for second component named "uniq_rate":
variables : character. categorical variable name
train_uniqcount : numeric. the number of unique value in train set
train_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in train set
test_uniqcount : numeric. the number of unique value in test set
test_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in test set
Variables of tbl_df for third component named "missing_level":
variables : character. variable name
n_levels : integer. count of level of categorical variable
train_missing_nlevel : integer. the number of non-existent levels in the train set
test_missing_nlevel : integer. he number of non-existent levels in the test set
library(dplyr) # Credit Card Default Data head(ISLR::Default) defaults <- ISLR::Default defaults$id <- seq(NROW(defaults)) set.seed(1) defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA set.seed(2) defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA sb <- defaults %>% split_by(default) sb %>% compare_diag() sb %>% compare_diag(add_character = TRUE) sb %>% compare_diag(uniq_thres = 0.0005)
library(dplyr) # Credit Card Default Data head(ISLR::Default) defaults <- ISLR::Default defaults$id <- seq(NROW(defaults)) set.seed(1) defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA set.seed(2) defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA sb <- defaults %>% split_by(default) sb %>% compare_diag() sb %>% compare_diag(add_character = TRUE) sb %>% compare_diag(uniq_thres = 0.0005)
compare_performance() compares the performance of a model with several model performance metrics.
compare_performance(model)
compare_performance(model)
model |
A model_df. results of predicted model that created by run_predict(). |
list. results of compared model performance. list has the following components:
recommend_model : character. The name of the model that is recommended as the best among the various models.
top_count : numeric. The number of best performing performance metrics by model.
mean_rank : numeric. Average of ranking individual performance metrics by model.
top_metric : list. The name of the performance metric with the best performance on individual performance metrics by model.
The performance metrics calculated are as follows.:
ZeroOneLoss : Normalized Zero-One Loss(Classification Error Loss).
Accuracy : Accuracy.
Precision : Precision.
Recall : Recall.
Specificity : Specificity.
F1_Score : F1 Score.
LogLoss : Log loss / Cross-Entropy Loss.
AUC : Area Under the Receiver Operating Characteristic Curve (ROC AUC).
Gini : Gini Coefficient.
PRAUC : Area Under the Precision-Recall Curve (PR AUC).
LiftAUC : Area Under the Lift Chart.
GainAUC : Area Under the Gain Chart.
KS_Stat : Kolmogorov-Smirnov Statistic.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") # Predict the model. pred <- run_predict(result, test) # Compare the model performance compare_performance(pred)
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") # Predict the model. pred <- run_predict(result, test) # Compare the model performance compare_performance(pred)
Plot compare information of the train set and test set included in the "split_df" class.
compare_plot(.data, ...)
compare_plot(.data, ...)
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
... |
one or more unquoted expressions separated by commas. Select the variable you want to plotting. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, compare_target_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
The numerical variables are density plots and the categorical variables are mosaic plots to compare the distribution of train sets and test sets.
There is no return value. Draw only the plot.
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_plot("income") sb %>% compare_plot()
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_plot("income") sb %>% compare_plot()
Compare the statistics of the categorical variables of the train set and test set included in the "split_df" class.
compare_target_category(.data, ..., add_character = FALSE, margin = FALSE)
compare_target_category(.data, ..., add_character = FALSE, margin = FALSE)
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
... |
one or more unquoted expressions separated by commas. Select the categorical variable you want to compare. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, compare_target_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
add_character |
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables. |
margin |
logical. Choose to calculate the marginal frequency information. |
Compare the statistics of the numerical variables of the train set and the test set to determine whether the raw data is well separated into two data sets.
tbl_df. Variables of tbl_df for comparison:
variable : character. categorical variable name
level : factor. level of categorical variables
train : numeric. the relative frequency of the level in the train set
test : numeric. the relative frequency of the level in the test set
abs_diff : numeric. the absolute value of the difference between two relative frequencies
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_target_category() sb %>% compare_target_category(add_character = TRUE) sb %>% compare_target_category(margin = TRUE) sb %>% compare_target_category(student) sb %>% compare_target_category(student, margin = TRUE)
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_target_category() sb %>% compare_target_category(add_character = TRUE) sb %>% compare_target_category(margin = TRUE) sb %>% compare_target_category(student) sb %>% compare_target_category(student, margin = TRUE)
Compare the statistics of the numerical variables of the train set and test set included in the "split_df" class.
compare_target_numeric(.data, ...)
compare_target_numeric(.data, ...)
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
... |
one or more unquoted expressions separated by commas. Select the numeric variable you want to compare. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, compare_target_numeric() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
Compare the statistics of the numerical variables of the train set and the test set to determine whether the raw data is well separated into two data sets.
tbl_df. Variables for comparison:
variable : character. numeric variable name
train_mean : numeric. arithmetic mean of train set
test_mean : numeric. arithmetic mean of test set
train_sd : numeric. standard deviation of train set
test_sd : numeric. standard deviation of test set
train_z : numeric. the arithmetic mean of the train set divided by the standard deviation
test_z : numeric. the arithmetic mean of the test set divided by the standard deviation
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_target_numeric() sb %>% compare_target_numeric(balance)
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb %>% compare_target_numeric() sb %>% compare_target_numeric(balance)
Extract train set or test set from split_df class object
extract_set(x, set = c("train", "test"))
extract_set(x, set = c("train", "test"))
x |
an object of class "split_df", usually, a result of a call to split_df(). |
set |
character. Specifies whether the extracted data is a train set or a test set. You can use "train" or "test". |
Extract the train or test sets based on the parameters you defined when creating split_df with split_by().
an object of class "tbl_df".
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) train <- sb %>% extract_set(set = "train") test <- sb %>% extract_set(set = "test")
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) train <- sb %>% extract_set(set = "train") test <- sb %>% extract_set(set = "test")
compute the Matthews correlation coefficient with actual and predict values.
matthews(predicted, y, positive)
matthews(predicted, y, positive)
predicted |
numeric. the predicted value of binary classification |
y |
factor or character. the actual value of binary classification |
positive |
level of positive class of binary classification |
The Matthews Correlation Coefficient has a value between -1 and 1, and the closer to 1, the better the performance of the binary classification.
numeric. The Matthews Correlation Coefficient.
# simulate actual data set.seed(123L) actual <- sample(c("Y", "N"), size = 100, prob = c(0.3, 0.7), replace = TRUE) actual # simulate predict data set.seed(123L) pred <- sample(c("Y", "N"), size = 100, prob = c(0.2, 0.8), replace = TRUE) pred # simulate confusion matrix table(pred, actual) matthews(pred, actual, "Y")
# simulate actual data set.seed(123L) actual <- sample(c("Y", "N"), size = 100, prob = c(0.3, 0.7), replace = TRUE) actual # simulate predict data set.seed(123L) pred <- sample(c("Y", "N"), size = 100, prob = c(0.2, 0.8), replace = TRUE) pred # simulate confusion matrix table(pred, actual) matthews(pred, actual, "Y")
Calculate some representative metrics for binary classification model evaluation.
performance_metric( pred, actual, positive, metric = c("ZeroOneLoss", "Accuracy", "Precision", "Recall", "Sensitivity", "Specificity", "F1_Score", "Fbeta_Score", "LogLoss", "AUC", "Gini", "PRAUC", "LiftAUC", "GainAUC", "KS_Stat", "ConfusionMatrix"), cutoff = 0.5, beta = 1 )
performance_metric( pred, actual, positive, metric = c("ZeroOneLoss", "Accuracy", "Precision", "Recall", "Sensitivity", "Specificity", "F1_Score", "Fbeta_Score", "LogLoss", "AUC", "Gini", "PRAUC", "LiftAUC", "GainAUC", "KS_Stat", "ConfusionMatrix"), cutoff = 0.5, beta = 1 )
pred |
numeric. Probability values that predicts the positive class of the target variable. |
actual |
factor. The value of the actual target variable. |
positive |
character. Level of positive class of binary classification. |
metric |
character. The performance metrics you want to calculate. See details. |
cutoff |
numeric. Threshold for classifying predicted probability values into positive and negative classes. |
beta |
numeric. Weight of precision in harmonic mean for F-Beta Score. |
The cutoff argument applies only if the metric argument is "ZeroOneLoss", "Accuracy", "Precision", "Recall", "Sensitivity", "Specificity", "F1_Score", "Fbeta_Score", "ConfusionMatrix".
numeric or table object. Confusion Matrix return by table object. and otherwise is numeric.: The performance metrics calculated are as follows.:
ZeroOneLoss : Normalized Zero-One Loss(Classification Error Loss).
Accuracy : Accuracy.
Precision : Precision.
Recall : Recall.
Sensitivity : Sensitivity.
Specificity : Specificity.
F1_Score : F1 Score.
Fbeta_Score : F-Beta Score.
LogLoss : Log loss / Cross-Entropy Loss.
AUC : Area Under the Receiver Operating Characteristic Curve (ROC AUC).
Gini : Gini Coefficient.
PRAUC : Area Under the Precision-Recall Curve (PR AUC).
LiftAUC : Area Under the Lift Chart.
GainAUC : Area Under the Gain Chart.
KS_Stat : Kolmogorov-Smirnov Statistic.
ConfusionMatrix : Confusion Matrix.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Predict the model. pred <- run_predict(result, test) pred # Calculate Accuracy. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "Accuracy") # Calculate Confusion Matrix. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "ConfusionMatrix") # Calculate Confusion Matrix by cutoff = 0.55. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "ConfusionMatrix", cutoff = 0.55)
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Predict the model. pred <- run_predict(result, test) pred # Calculate Accuracy. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "Accuracy") # Calculate Confusion Matrix. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "ConfusionMatrix") # Calculate Confusion Matrix by cutoff = 0.55. performance_metric(attr(pred$predicted[[1]], "pred_prob"), test$Kyphosis, "present", "ConfusionMatrix", cutoff = 0.55)
plot_cutoff() visualizes a plot to select a cut-off that separates positive and negative from the probabilities that are predictions of a binary classification, and suggests a cut-off.
plot_cutoff( predicted, y, positive, type = c("mcc", "density", "prob"), measure = c("mcc", "cross", "half") )
plot_cutoff( predicted, y, positive, type = c("mcc", "density", "prob"), measure = c("mcc", "cross", "half") )
predicted |
numeric. the predicted value of binary classification |
y |
factor or character. the actual value of binary classification |
positive |
level of positive class of binary classification |
type |
character. Visualization type. "mcc" draw the Matthews Correlation Coefficient scatter plot, "density" draw the density plot of negative and positive, and "prob" draws line or points plots of the predicted probability. |
measure |
character. The kind of measure that calculates the cutoff. "mcc" is the Matthews Correlation Coefficient, "cross" is the point where the positive and negative densities cross, and "half" is the median of the probability, 0.5 |
If the type argument is "prob", visualize the points plot if the number of observations is less than 100. If the observation is greater than 100, draw a line plot. In this case, the speed of visualization can be slow.
numeric. cut-off value
library(ggplot2) library(rpart) data(kyphosis) fit <- glm(Kyphosis ~., family = binomial, kyphosis) pred <- predict(fit, type = "response") cutoff <- plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc") cutoff plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc", measure = "half") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "mcc") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "half") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "mcc") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "half")
library(ggplot2) library(rpart) data(kyphosis) fit <- glm(Kyphosis ~., family = binomial, kyphosis) pred <- predict(fit, type = "response") cutoff <- plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc") cutoff plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "mcc", measure = "half") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "mcc") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "density", measure = "half") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "mcc") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "cross") plot_cutoff(pred, kyphosis$Kyphosis, "present", type = "prob", measure = "half")
plot_performance() visualizes a plot to ROC curve that separates model algorithm.
plot_performance(model)
plot_performance(model)
model |
A model_df. results of predicted model that created by run_predict(). |
The ROC curve is output for each model included in the model_df class object specified as a model argument.
There is no return value. Only the plot is drawn.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") # Predict the model. pred <- run_predict(result, test) # Plot ROC curve plot_performance(pred)
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") # Predict the model. pred <- run_predict(result, test) # Plot ROC curve plot_performance(pred)
Fit some representative binary classification models.
run_models( .data, target, positive, models = c("logistic", "rpart", "ctree", "randomForest", "ranger", "xgboost", "lasso") )
run_models( .data, target, positive, models = c("logistic", "rpart", "ctree", "randomForest", "ranger", "xgboost", "lasso") )
.data |
A train_df. Train data to fit the model. It also supports tbl_df, tbl, and data.frame objects. |
target |
character. Name of target variable. |
positive |
character. Level of positive class of binary classification. |
models |
character. Algorithm types of model to fit. See details. default value is c("logistic", "rpart", "ctree", "randomForest", "ranger", "lasso"). |
Supported models are functions supported by the representative model package used in R environment. The following binary classifications are supported:
"logistic" : logistic regression by glm() in stats package.
"rpart" : recursive partitioning tree model by rpart() in rpart package.
"ctree" : conditional inference tree model by ctree() in party package.
"randomForest" : random forest model by randomForest() in randomForest package.
"ranger" : random forest model by ranger() in ranger package.
"xgboost" : XGBoosting model by xgboost() in xgboost package.
"lasso" : lasso model by glmnet() in glmnet package.
run_models() executes the process in parallel when fitting the model. However, it is not supported in MS-Windows operating system and RStudio environment.
model_df. results of fitted model. model_df is composed of tbl_df and contains the following variables.:
step : character. The current stage in the model fit process. The result of calling run_models() is returned as "1.Fitted".
model_id : character. Type of fit model.
target : character. Name of target variable.
is_factor : logical. Indicates whether the target variable is a factor.
positive : character. Level of positive class of binary classification.
negative : character. Level of negative class of binary classification.
fitted_model : list. Fitted model object.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Run the several kinds model fitting by dplyr train %>% run_models(target = "Kyphosis", positive = "present")
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Run the several kinds model fitting by dplyr train %>% run_models(target = "Kyphosis", positive = "present")
Apply calculate performance metrics for binary classification model evaluation.
run_performance(model, actual = NULL)
run_performance(model, actual = NULL)
model |
A model_df. results of predicted model that created by run_predict(). |
actual |
factor. A data of target variable to evaluate the model. It supports factor that has binary class. |
run_performance() is performed in parallel when calculating the performance evaluation index. However, it is not supported in MS-Windows operating system and RStudio environment.
model_df. results of predicted model. model_df is composed of tbl_df and contains the following variables.:
step : character. The current stage in the model fit process. The result of calling run_performance() is returned as "3.Performanced".
model_id : character. Type of fit model.
target : character. Name of target variable.
positive : character. Level of positive class of binary classification.
fitted_model : list. Fitted model object.
predicted : list. Predicted value by individual model. Each value has a predict_class class object.
performance : list. Calculate metrics by individual model. Each value has a numeric vector.
The performance metrics calculated are as follows.:
ZeroOneLoss : Normalized Zero-One Loss(Classification Error Loss).
Accuracy : Accuracy.
Precision : Precision.
Recall : Recall.
Sensitivity : Sensitivity.
Specificity : Specificity.
F1_Score : F1 Score.
Fbeta_Score : F-Beta Score.
LogLoss : Log loss / Cross-Entropy Loss.
AUC : Area Under the Receiver Operating Characteristic Curve (ROC AUC).
Gini : Gini Coefficient.
PRAUC : Area Under the Precision-Recall Curve (PR AUC).
LiftAUC : Area Under the Lift Chart.
GainAUC : Area Under the Gain Chart.
KS_Stat : Kolmogorov-Smirnov Statistic.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Predict the model. (Case 1) pred <- run_predict(result, test) pred # Calculate performace metrics. (Case 1) perf <- run_performance(pred) perf perf$performance # Predict the model. (Case 2) pred <- run_predict(result, test[, -1]) pred # Calculate performace metrics. (Case 2) perf <- run_performance(pred, pull(test[, 1])) perf perf$performance # Convert to matrix for compare performace. sapply(perf$performance, "c")
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Predict the model. (Case 1) pred <- run_predict(result, test) pred # Calculate performace metrics. (Case 1) perf <- run_performance(pred) perf perf$performance # Predict the model. (Case 2) pred <- run_predict(result, test[, -1]) pred # Calculate performace metrics. (Case 2) perf <- run_performance(pred, pull(test[, 1])) perf perf$performance # Convert to matrix for compare performace. sapply(perf$performance, "c")
Predict some representative binary classification models.
run_predict(model, .data, cutoff = 0.5)
run_predict(model, .data, cutoff = 0.5)
model |
A model_df. results of fitted model that created by run_models(). |
.data |
A tbl_df. The data set to predict the model. It also supports tbl, and data.frame objects. |
cutoff |
numeric. Cut-off that determines the positive from the probability of predicting the positive. |
Supported models are functions supported by the representative model package used in R environment. The following binary classifications are supported:
"logistic" : logistic regression by predict.glm() in stats package.
"rpart" : recursive partitioning tree model by predict.rpart() in rpart package.
"ctree" : conditional inference tree model by predict() in stats package.
"randomForest" : random forest model by predict.randomForest() in randomForest package.
"ranger" : random forest model by predict.ranger() in ranger package.
"xgboost" : random forest model by predict.xgb.Booster() in xgboost package.
"lasso" : random forest model by predict.glmnet() in glmnet package.
run_predict() is executed in parallel when predicting by model. However, it is not supported in MS-Windows operating system and RStudio environment.
model_df. results of predicted model. model_df is composed of tbl_df and contains the following variables.:
step : character. The current stage in the model fit process. The result of calling run_predict() is returned as "2.Predicted".
model_id : character. Type of fit model.
target : character. Name of target variable.
is_factor : logical. Indicates whether the target variable is a factor.
positive : character. Level of positive class of binary classification.
negative : character. Level of negative class of binary classification.
fitted_model : list. Fitted model object.
predicted : list. Predicted value by individual model. Each value has a predict_class class object.
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Run the several kinds model predict by dplyr result %>% run_predict(test)
library(dplyr) # Divide the train data set and the test data set. sb <- rpart::kyphosis %>% split_by(Kyphosis) # Extract the train data set from original data set. train <- sb %>% extract_set(set = "train") # Extract the test data set from original data set. test <- sb %>% extract_set(set = "test") # Sampling for unbalanced data set using SMOTE(synthetic minority over-sampling technique). train <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # Cleaning the set. train <- train %>% cleanse # Run the model fitting. result <- run_models(.data = train, target = "Kyphosis", positive = "present") result # Run the several kinds model predict by dplyr result %>% run_predict(test)
To solve the imbalanced class, perform sampling in the train set of split_df.
sampling_target( .data, method = c("ubUnder", "ubOver", "ubSMOTE"), seed = NULL, perc = 50, k = ifelse(method == "ubSMOTE", 5, 0), perc.over = 200, perc.under = 200 )
sampling_target( .data, method = c("ubUnder", "ubOver", "ubSMOTE"), seed = NULL, perc = 50, k = ifelse(method == "ubSMOTE", 5, 0), perc.over = 200, perc.under = 200 )
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
method |
character. sampling methods. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique). |
seed |
integer. random seed used for sampling |
perc |
integer. The percentage of positive class in the final dataset. It is used only in under-sampling. The default is 50. perc can not exceed 50. |
k |
integer. It is used only in over-sampling and SMOTE. If over-sampling and if K=0: sample with replacement from the minority class until we have the same number of instances in each class. under-sampling and if K>0: sample with replacement from the minority class until we have k-times the original number of minority instances. If SMOTE, the number of neighbours to consider as the pool from where the new examples are generated |
perc.over |
integer. It is used only in SMOTE. per.over/100 is the number of new instances generated for each rare instance. If perc.over < 100 a single instance is generated. |
perc.under |
integer. It is used only in SMOTE. perc.under/100 is the number of "normal" (majority class) instances that are randomly selected for each smoted observation. |
In order to solve the problem of imbalanced class, sampling is performed by under sampling, over sampling, SMOTE method.
An object of train_df.
The attributes of the train_df class are as follows.:
sample_seed : integer. random seed used for sampling
method : character. sampling methods.
perc : integer. perc argument value
k : integer. k argument value
perc.over : integer. perc.over argument value
perc.under : integer. perc.under argument value
binary : logical. whether the target variable is a binary class
target : character. target variable name
minority : character. the level of the minority class
majority : character. the level of the majority class
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) # under-sampling with random seed under <- sb %>% sampling_target(seed = 1234L) under %>% count(default) # under-sampling with random seed, and minority class frequency is 40% under40 <- sb %>% sampling_target(seed = 1234L, perc = 40) under40 %>% count(default) # over-sampling with random seed over <- sb %>% sampling_target(method = "ubOver", seed = 1234L) over %>% count(default) # over-sampling with random seed, and k = 10 over10 <- sb %>% sampling_target(method = "ubOver", seed = 1234L, k = 10) over10 %>% count(default) # SMOTE with random seed smote <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L) smote %>% count(default) # SMOTE with random seed, and perc.under = 250 smote250 <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250) smote250 %>% count(default)
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) # under-sampling with random seed under <- sb %>% sampling_target(seed = 1234L) under %>% count(default) # under-sampling with random seed, and minority class frequency is 40% under40 <- sb %>% sampling_target(seed = 1234L, perc = 40) under40 %>% count(default) # over-sampling with random seed over <- sb %>% sampling_target(method = "ubOver", seed = 1234L) over %>% count(default) # over-sampling with random seed, and k = 10 over10 <- sb %>% sampling_target(method = "ubOver", seed = 1234L, k = 10) over10 %>% count(default) # SMOTE with random seed smote <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L) smote %>% count(default) # SMOTE with random seed, and perc.under = 250 smote250 <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250) smote250 %>% count(default)
The split_by() splits the data.frame or tbl_df into a train set and a test set.
split_by(.data, ...) ## S3 method for class 'data.frame' split_by(.data, target, ratio = 0.7, seed = NULL, ...)
split_by(.data, ...) ## S3 method for class 'data.frame' split_by(.data, target, ratio = 0.7, seed = NULL, ...)
.data |
a data.frame or a |
... |
further arguments passed to or from other methods. |
target |
unquoted expression or variable name. the name of the target variable |
ratio |
numeric. the ratio of the train dataset. default is 0.7 |
seed |
random seed used for splitting |
The split_df class is created, which contains the split information and criteria to separate the training and the test set.
An object of split_by.
The attributes of the split_df class are as follows.:
split_seed : integer. random seed used for splitting
target : character. the name of the target variable
binary : logical. whether the target variable is binary class
minority : character. the name of the minority class
majority : character. the name of the majority class
minority_rate : numeric. the rate of the minority class
majority_rate : numeric. the rate of the majority class
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb
summary method for "split_df" class.
## S3 method for class 'split_df' summary(object, ...)
## S3 method for class 'split_df' summary(object, ...)
object |
an object of class "split_df", usually, a result of a call to split_df(). |
... |
further arguments passed to or from other methods. |
summary.split_df provides information on the number of two split data sets, minority class and majority class.
NULL is returned. However, the split train set and test set information are displayed. The output information is as follows.:
Random seed
Number of train sets and test sets
Name of target variable
Target variable minority class and majority class information (label and ratio)
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb summary(sb)
library(dplyr) # Credit Card Default Data head(ISLR::Default) # Generate data for the example sb <- ISLR::Default %>% split_by(default) sb summary(sb)
The treatment_corr() diagnose pairs of highly correlated variables or remove on of them.
treatment_corr(.data, corr_thres = 0.8, treat = TRUE, verbose = TRUE)
treatment_corr(.data, corr_thres = 0.8, treat = TRUE, verbose = TRUE)
.data |
a data.frame or a |
corr_thres |
numeric. Set a threshold to detecting variables when correlation greater then threshold. |
treat |
logical. Set whether to removing variables |
verbose |
logical. Set whether to echo information to the console at runtime. |
The correlation coefficient of pearson is obtained for continuous variables and the correlation coefficient of spearman for categorical variables.
An object of data.frame or train_df. and return value is an object of the same type as the .data argument. However, several variables can be excluded by correlation between variables.
# numerical variable x1 <- 1:100 set.seed(12L) x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1) set.seed(1234L) x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1) # categorical variable x4 <- factor(rep(letters[1:20], time = 5)) set.seed(100L) x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5)) set.seed(200L) x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5)) set.seed(300L) x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE)) exam <- data.frame(x1, x2, x3, x4, x5, x6, x7) str(exam) head(exam) # default case treatment_corr(exam) # not removing variables treatment_corr(exam, treat = FALSE) # Set a threshold to detecting variables when correlation greater then 0.9 treatment_corr(exam, corr_thres = 0.9, treat = FALSE) # not verbose mode treatment_corr(exam, verbose = FALSE)
# numerical variable x1 <- 1:100 set.seed(12L) x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1) set.seed(1234L) x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1) # categorical variable x4 <- factor(rep(letters[1:20], time = 5)) set.seed(100L) x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5)) set.seed(200L) x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5)) set.seed(300L) x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE)) exam <- data.frame(x1, x2, x3, x4, x5, x6, x7) str(exam) head(exam) # default case treatment_corr(exam) # not removing variables treatment_corr(exam, treat = FALSE) # Set a threshold to detecting variables when correlation greater then 0.9 treatment_corr(exam, corr_thres = 0.9, treat = FALSE) # not verbose mode treatment_corr(exam, verbose = FALSE)