API¶
A data cleaning Python tool.

dataclean.
autoclean
(Xy, dataset_name, features)¶ Autocleans data.
The following aspects are automatically cleaned: show important features; show statistical information; discover the data type for each feature; identify the duplicated rowsl; unify the inconsistent column names; handle missing values; handle outliers.
Parameters: Xy : arraylike
Complete data.
dataset_name : string
features : list
List of feature names.
Returns: Xy_cleaned : arraylike
Cleaned data.

dataclean.
build_forest
(X, y)¶ Build random forest model from the dataset and compute important features
Parameters: X : arraylike, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : arraylike, shape (n_samples,)
Target values (class labels in classification, real numbers in regression).
Returns: importances : array, shape = [n_features]
The feature importances (the higher, the more important the feature).
indices : array, shape = [n_features]
Reverse the importances.

dataclean.
clean_duplicated_rows
(Xy)¶ Clean duplicated rows.
Parameters: Xy : arraylike
Complete numpy array (target required) of the dataset.
Returns: Xy : arraylike
Original data.
Xy_no_duplicate : arraylike
Cleaned data without duplicated rows if user wants to drop the duplicated rows.

dataclean.
clean_missing
(df, features)¶ Clean missing values in the dataset.
Parameters: df : DataFrame
features : List
List of feature names.
Returns: features_new : List
List of feature names after cleaning.
Xy_filled : arraylike
Numpy array where missing values have been cleaned.

dataclean.
compute_clustering_metafeatures
(X)¶ Computes clustering meta features.
The following 3 clustering meta features are adopted: Silhouette Coefficient; Calinski_Harabasz Index; Davies_Bouldin Index.

dataclean.
compute_imputation_score
(Xy)¶ Computes score of the imputation by applying simple classifiers.
The following simple learners are evaluated: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner.
Parameters: Xy : arraylike
Complete numpy array of the dataset. The training array X has to be imputed already, and the target y is required here and not optional in order to predict the performance of the imputation method.
Returns: imputation_score : float
Predicted score of the imputation method.

dataclean.
compute_metafeatures
(X, y)¶ Computes landmarking meta features.
The following landmarking features are computed: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner; Randomly Chosen Node Learner.

dataclean.
deal_mar
(df)¶ Deal with missing data with missing at random pattern.

dataclean.
deal_mcar
(df)¶ Deal with missing data with missing completely at random pattern.

dataclean.
deal_mnar
(df)¶ Deal with missing data with missing at random pattern.

dataclean.
discover_type_heuristic
(data)¶ Infer data types for each feature using simple logic
Parameters: data : numpy array or dataframe
Numeric data needs to be 64 bit.
Returns: result : list
List of data types.

dataclean.
discover_types
(Xy)¶ Discover types for numpy array.
Both simple logic rules and Bayesian methods are applied. Bayesian methods can only be applied if Xy are numeric.
Parameters: Xy : numpy array or DataFrame
Xy can only be numeric in order to run the Bayesian model.

dataclean.
drop_duplicated_rows
(dataframe)¶ Drop duplicatd rows.

dataclean.
drop_outliers
(df, df_outliers)¶ Drops the detected outliers.

dataclean.
handle_missing
(features, Xy)¶ Handle missing values.
Recommend the approprate approach to the user given the missing mechanism of the dataset. The user can choose to adopt the recommended approach or take another available approach.
For MCAR, the following methods are evaluated: ‘list deletion’, ‘mean’, ‘mode’, ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.
For MAR, the following methods are evaluated: ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.
For MNAR, ‘multiple imputation’ is adopted.
Parameters: features : list
List of feature names.
Xy : arraylike
Complete numpy array (target required and not optional).
Returns: features_new : List
List of feature names after cleaning.
Xy_filled : arraylike
Numpy array where missing values have been cleaned.

dataclean.
handle_outlier
(features, Xy)¶ Cleans the outliers.
Recommends the algorithm to the user to detect the outliers and presents the outliers to the user in effective visualizations. The user can decides whether or not to keep the outliers.
Parameters: features : list
List of feature names.
Xy : arraylike
Numpy array. Both training vectors and target are required.
Returns: Xy_no_outliers : arraylike
Cleaned data where outliers are dropped.
Xy : arraylike
Original data where outliers are not found or kept.

dataclean.
highlight_outlier
(data)¶ Highlight the maximum in a Series yellow.

dataclean.
identify_missing
(df=None)¶ Detect missing values.
Identify the common missing characters such as ‘n/a’, ‘na’, ‘–’ and ‘?’ as missing. User can also customize the characters to be identified as missing.
Parameters: df : DataFrame
Raw data formatted in DataFrame.
Returns: flag : bool
Indicates whether missing values are detected. If true, missing values are detected. Otherwise not.

dataclean.
identify_missing_mechanism
(df=None)¶ Tries to guess the missing mechanism of the dataset.
Missing mechanism is not really testable. There may be reasons to suspect that the dataset belongs to one missing mechanism based on the missing correlation between features, but the result is not definite. Relevant information are provided to help the user make the decision. Three missng mechanisms to be guessed: MCAR: Missing completely at ramdom MAR: Missing at random MNAR: Missing not at random (not available here, normally involes field expert)
Parameters: df : DataFrame
Raw data formatted in DataFrame.

dataclean.
identify_outliers
(df, algorithm=0, detailed=False)¶ Identifies outliers in multi dimension.
Dataset has to be parsed as numeric beforehand.

dataclean.
infer_feature_type
(feature)¶ Infer data types for the given feature using simple logic.
Possible data types to infer: boolean, date, float, integer, string Feature that is not either a boolean, a date, a float or an integer, is classified as a string.
Parameters: feature : arraylike
A feature/attribute vector.
Returns: data_type : string
The data type of the given feature/attribute.

dataclean.
missing_preprocess
(features, df=None)¶ Drops the redundant information.
Redundant information is dropped before imputation. Detects and drops empty rows. Detects features and instances with extreme large proportion of missing data and reports to the user.
Parameters: features : list
List of feature names.
df : DataFrame
Returns: df : DataFrame
New DataFrame where redundant information may have been deleted.
features_new: list
List of feature names after preprocessing.

dataclean.
plot_feature_importances
(dataset_name, features, importances, indices)¶ Plot the 15 most important features.

dataclean.
predict_best_anomaly_algorithm
(X, y)¶ Predicts best anomaly detection algorithm.
Recommends the best anomaly detection algorithm to the user given the characteristics of the dataset. The following algorithms are considered: 0: isolation forest; 1: local outlier factor; 2: one class support vector machine.

dataclean.
show_important_features
(X, y, data_name, features)¶ Show the most important features of the given dataset.
Computes the most important features of the given dataset using random forest, and present the 15 most useful features to the user with a bar chart.
Parameters: X : arraylike, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : arraylike, shape (n_samples,)
Target values (class labels in classification, real numbers in regression).
data_name : string
Dataset name.
features : list
List of feature names.

dataclean.
show_statistical_info
(Xy)¶ Show statistical information of the given dataset
Parameters: Xy : arraylike

dataclean.
train_metalearner
()¶ Train metalearner

dataclean.
unify_name_consistency
(names)¶ Unify inconsistent column names.
Parameters: names : list
List of original column names.
Returns: names : list
Unified column names.

dataclean.
visualize_missing
(df=None)¶ Visualize missing values.
The missingness of the dataset is visualized in bar chart, matrix and heatmap.

dataclean.
visualize_outliers_parallel_coordinates
(df_scaled, df_pred)¶ Visualizes highdimensional outliers with a parallel coordinates plot.

dataclean.
visualize_outliers_scatter
(df, df_pred)¶ Visualizes highdimensional outliers with a scatter plot.
Selects out the two features most likely to have outliers and shows them in a scatter plot.