--- title: "Introduce dlookr" author: "Choonghyun Ryu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduce dlookr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r environment, echo = FALSE, message = FALSE, warning=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70) options(tibble.print_min = 4L, tibble.print_max = 4L) library(dlookr) library(dplyr) library(ggplot2) ``` ## Preface After you have acquired the data, you should do the following: * Diagnose data quality. + If there is a problem with data quality, + The data must be corrected or re-acquired. * Explore data to understand the data and find scenarios for performing the analysis. * Derive new variables or perform variable transformations. The dlookr package makes these steps fast and easy: * Performs a data diagnosis or automatically generates a data diagnosis report. * Discover data in various ways and automatically generate EDA(exploratory data analysis) reports. * Impute missing values and outliers, resolve skewed data, and binaries continuous variables into categorical variables. And generates an automated report to support it. dlookr increases synergy with `dplyr`. Particularly in data exploration and data wrangling, it increases the efficiency of the `tidyverse` package group. ## Supported data structures Data diagnosis supports the following data structures. * data frame: data.frame class. * data table: tbl_df class. * table of DBMS: table of the DBMS through tbl_dbi + Use dplyr as the back-end interface for any DBI-compatible database. ## List of supported tasks of data analytics ### Diagnose Data #### Overall Diagnose Data Tasks | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: describe overview of data | Inquire basic information to understand the data in general | `overview()` | summary overview object | summary described overview of data | `summary.overview()` | plot overview object | plot described overview of data | `plot.overview()` | diagnose data quality of variables | The scope of data quality diagnosis is information on missing values and unique value information | `diagnose()` | x diagnose data quality of categorical variables | frequency, ratio, rank by levels of each variables | `diagnose_category()` | x diagnose data quality of numerical variables | descriptive statistics, number of zero, minus, outliers | `diagnose_numeric()` | x diagnose data quality for outlier | number of outliers, ratio, mean of outliers, mean with outliers, mean without outliers | `diagnose_outlier()` | x plot outliers information of numerical data | box plot and histogram whith outliers, without outliers | `plot_outlier.data.frame()` | x plot outliers information of numerical data by target variable | box plot and density plot whith outliers, without outliers | `plot_outlier.target_df()` | x diagnose combination of categorical variables | Check for sparse cases of level combinations of categorical variables | `diagnose_sparese()` | #### Visualize Missing Values Tasks | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: pareto chart for missing value | visualize the Pareto chart for variables with a missing value. | `plot_na_pareto()` | combination chart for missing value | visualize the distribution of missing value by combining variables. | `plot_na_hclust()` | plot the combination variables that is include missing value | visualize the combinations of missing value across cases | `plot_na_intersect()` | #### Reporting Types | Descriptions | Functions | Support DBI :-----|:-------|:---|:---: report the information of data diagnosis into a PDF file | report the information for diagnosing the data quality | `diagnose_report()` | x reporting the information of data diagnosis into HTML file | report the information for diagnosing the quality of the data | `diagnose_report()` | x reporting the information of data diagnosis into HTML file | dynamic report the information for diagnosing the quality of the data | `diagnose_web_report()` | x reporting the information of data diagnosis into PDF and HTML files | paged report the information for diagnosing the quality of the data | `diagnose_paged_report()` | x ### EDA #### Univariate EDA Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | summaries | frequency tables | `univar_category()` | categorical | summaries | chi-squared test | `summary.univar_category()` | categorical | visualize | bar charts | `plot.univar_category()` | categorical | visualize | bar charts | `plot_bar_category()` | numerical | summaries | descriptive statistics | `describe()` | x numerical | summaries | descriptive statistics | `univar_numeric()` | numerical | summaries | descriptive statistics of standardized variable | `summary.univar_numeric()` | numerical | visualize | histogram, box plot | `plot.univar_numeric()` | numerical | visualize | Q-Q plots | `plot_qq_numeric()` | numerical | visualize | box plot | `plot_box_numeric()` | numerical | visualize | histogram | `plot_hist_numeric()` | #### Bivariate EDA Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | summaries | frequency tables cross cases | `compare_category()` | categorical | summaries | contingency tables, chi-squared test | `summary.compare_category()` | categorical | visualize | mosaics plot | `plot.compare_category()` | numerical | summaries | correlation coefficient, linear model summaries | `compare_numeric()` | numerical | summaries | correlation coefficient, linear model summaries with threshold | `summary.compare_numeric()` | numerical | visualize | scatter plot with marginal box plot | `plot.compare_numeric()` | numerical | Correlate | correlation coefficient | `correlate()` | x numerical | Correlate | summaries with correlation matrix | `summary.correlate()` | x numerical | Correlate | visualization of a correlation matrix | `plot.correlate()` | x both | PPS | PPS(Predictive Power Score) | `pps()` | x both | PPS | summaries with PPS | `summary.pps()` | x both | PPS | visualization of a PPS matrix | `plot.pps()` | x #### Normality Test Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: numerical | summaries | Shapiro-Wilk normality test | `normality()` | x numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | `plot_normality()` | x #### Relationship between target variable and predictors Target Variable | Predictor | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | categorical | contingency tables | `relate()` | x categorical | categorical | mosaics plot | `plot.relate()` | x categorical | numerical | descriptive statistic for each levels and total observation | `relate()` | x categorical | numerical | density plot | `plot.relate()` | x categorical | categorical | bar charts | `plot_bar_category()` | numerical | categorical | ANOVA test | `relate()` | x numerical | categorical | scatter plot | `plot.relate()` | x numerical | numerical | simple linear model | `relate()` | x numerical | numerical | box plot | `plot.relate()` | x categorical | numerical | Q-Q plots | `plot_qq_numeric()` | categorical | numerical | box plot | `plot_box_numeric()` | categorical | numerical | histogram | `plot_hist_numeric()` | #### Reporting Types | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: reporting the information of EDA into PDF file | reporting the information of EDA | `eda_report()` | x reporting the information of EDA into HTML file | reporting the information of EDA | `eda_report()` | x reporting the information of EDA into PDF file | dynamic reporting the information of EDA | `eda_web_report()` | x reporting the information of EDA into HTML file | paged reporting the information of EDA | `eda_paged_report()` | x ### Transform Data #### Find Variables Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: missing values | find the variable that contains the missing value in the object that inherits the data.frame | `find_na()` | outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | `find_outliers()` | skewed variable | find the numerical variable that is the skewed variable that inherits the data.frame | `find_skewness()` | #### Imputation Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: missing values | missing values are imputed with some representative values and statistical methods. | `imputate_na()` | outliers | outliers are imputed with some representative values and statistical methods. | `imputate_outlier()` | summaries | calculate descriptive statistics of the original and imputed values. | `summary.imputation()` | visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | `plot.imputation()` | #### Binning Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: binning | converts a numeric variable to a categorization variable | `binning()` | summaries | calculate frequency and relative frequency for each levels(bins) | `summary.bins()` | visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | `plot.bins()` | optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | `binning_by()` | summaries | summary metrics to evaluate the performance of binomial classification model | `summary.optimal_bins()` | visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | `plot.optimal_bins()` | infogain binning | categorizes a numeric characteristic into bins for multi-class variables using recursive information gain ratio maximization | `binning_rgr()` | visualize | generates plots for understanding distribution and distribution by target variable after running binning_rgr() | `plot.infogain_bins()` | evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | `performance_bin()` | summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | `summary.performance_bin()` | visualize | It generates plots to understand frequency, WoE by bins using performance_bin after running binning_by() | `plot.performance_bin()` | visualize | extract bins from "bins" and "optimal_bins" objects | `extract.bins()` | #### Diagnose Binned Variable Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model | `performance_bin()` | summaries | summary method for "performance_bin". summary metrics to evaluate the performance of the binomial classification model | ` summary.performance_bin()` | visualize | visualize for understanding frequency, WoE by bins using performance_bin and something else | `plot.performance_bin()` | #### Transformation Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: transformation | performs variable transformation for standardization and resolving skewness of numerical variables | `transform()` | summaries | compares the distribution of data before and after data transformation | `summary.transform()` | visualize | visualize two kinds of a plot by attribute of the 'transform' class. The transformation of a numerical variable is a density plot | `plot.transform()` | #### Reporting Types | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: reporting the information of transformation into PDF | reporting the information of transformation | `transformation_report()` | reporting the information of transformation into HTML | reporting the information of transformation | `transformation_report()` | reporting the transformation information into PDF | dynamic reporting the transformation information | `transformation_web_report()` | reporting the information of transformation into HTML | paged reporting the information of transformation | `transformation_paged_report()` | ### Miscellaneous #### Statistics Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: statistics | calculate the entropy | `entropy()` | statistics | calculate the skewness of the data | `skewness()` | statistics | calculate the kurtosis of the data | `kurtosis()` | statistics | calculate the Jensen-Shannon divergence between two probability distributions | `jsd()` | statistics | calculate the Kullback-Leibler divergence between two probability distributions | `kld()` | statistics | calculate the Cramer's V statistic between two categorical(discrete) variables | `cramer()` | statistics | calculate the Theil's U statistic between two categorical(discrete) variables | `theil()` | statistics | finding percentile of a numerical variable. | `get_percentile()` | statistics | transform a numeric vector using several methods like "log", "sqrt", "log+1", "log+a", "1/x", "x^2", "x^3", "Box-Cox", "Yeo-Johnson"| `get_transform()` | statistics | calculate the Cramer's V statistic | `cramer()` | statistics | calculate the Theil's U statistic | `theil()` | #### Programming Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: programming | extracts variable information having a certain class from an object inheriting data.frame | `find_class()` | programming | gets class of variables in data.frame or tbl_df | `get_class()` | programming | retrieves the column information of the DBMS table through the tbl_bdi object of dplyr | `get_column_info()` | programming | finding the user machine's OS. | `get_os()` | programming | import Google fonts | `import_google_font()` |