API¶

A data cleaning Python tool.

dataclean.autoclean(Xy, dataset_name, features)¶

Auto-cleans data.

The following aspects are automatically cleaned: show important features; show statistical information; discover the data type for each feature; identify the duplicated rowsl; unify the inconsistent column names; handle missing values; handle outliers.

Parameters:

Xy : array-like

Complete data.

dataset_name : string

features : list

List of feature names.

Returns:

Xy_cleaned : array-like

Cleaned data.

dataclean.build_forest(X, y)¶

Build random forest model from the dataset and compute important features

Parameters:

X : array-like, shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples,)

Target values (class labels in classification, real numbers in regression).

Returns:

importances : array, shape = [n_features]

The feature importances (the higher, the more important the feature).

indices : array, shape = [n_features]

Reverse the importances.

dataclean.clean_duplicated_rows(Xy)¶

Clean duplicated rows.

Parameters:

Xy : array-like

Complete numpy array (target required) of the dataset.

Returns:

Xy : array-like

Original data.

Xy_no_duplicate : array-like

Cleaned data without duplicated rows if user wants to drop the duplicated rows.

dataclean.clean_missing(df, features)¶

Clean missing values in the dataset.

Parameters:

df : DataFrame

features : List

List of feature names.

Returns:

features_new : List

List of feature names after cleaning.

Xy_filled : array-like

Numpy array where missing values have been cleaned.

dataclean.compute_clustering_metafeatures(X)¶

Computes clustering meta features.

The following 3 clustering meta features are adopted: Silhouette Coefficient; Calinski_Harabasz Index; Davies_Bouldin Index.

dataclean.compute_imputation_score(Xy)¶

Computes score of the imputation by applying simple classifiers.

The following simple learners are evaluated: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner.

Parameters:

Xy : array-like

Complete numpy array of the dataset. The training array X has to be imputed already, and the target y is required here and not optional in order to predict the performance of the imputation method.

Returns:

imputation_score : float

Predicted score of the imputation method.

dataclean.compute_metafeatures(X, y)¶

Computes landmarking meta features.

The following landmarking features are computed: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner; Randomly Chosen Node Learner.

dataclean.deal_mar(df)¶: Deal with missing data with missing at random pattern.

dataclean.deal_mcar(df)¶: Deal with missing data with missing completely at random pattern.

dataclean.deal_mnar(df)¶: Deal with missing data with missing at random pattern.

dataclean.discover_type_heuristic(data)¶

Infer data types for each feature using simple logic

Parameters:

data : numpy array or dataframe

Numeric data needs to be 64 bit.

Returns:

result : list

List of data types.

dataclean.discover_types(Xy)¶

Discover types for numpy array.

Both simple logic rules and Bayesian methods are applied. Bayesian methods can only be applied if Xy are numeric.

Parameters:

Xy : numpy array or DataFrame

Xy can only be numeric in order to run the Bayesian model.

dataclean.drop_duplicated_rows(dataframe)¶: Drop duplicatd rows.

dataclean.drop_outliers(df, df_outliers)¶: Drops the detected outliers.

dataclean.handle_missing(features, Xy)¶

Handle missing values.

Recommend the approprate approach to the user given the missing mechanism of the dataset. The user can choose to adopt the recommended approach or take another available approach.

For MCAR, the following methods are evaluated: ‘list deletion’, ‘mean’, ‘mode’, ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.

For MAR, the following methods are evaluated: ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.

For MNAR, ‘multiple imputation’ is adopted.

Parameters:

features : list

List of feature names.

Xy : array-like

Complete numpy array (target required and not optional).

Returns:

features_new : List

List of feature names after cleaning.

Xy_filled : array-like

Numpy array where missing values have been cleaned.

dataclean.handle_outlier(features, Xy)¶

Cleans the outliers.

Recommends the algorithm to the user to detect the outliers and presents the outliers to the user in effective visualizations. The user can decides whether or not to keep the outliers.

Parameters:

features : list

List of feature names.

Xy : array-like

Numpy array. Both training vectors and target are required.

Returns:

Xy_no_outliers : array-like

Cleaned data where outliers are dropped.

Xy : array-like

Original data where outliers are not found or kept.

dataclean.highlight_outlier(data)¶: Highlight the maximum in a Series yellow.

dataclean.identify_missing(df=None)¶

Detect missing values.

Identify the common missing characters such as ‘n/a’, ‘na’, ‘–’ and ‘?’ as missing. User can also customize the characters to be identified as missing.

Parameters:

df : DataFrame

Raw data formatted in DataFrame.

Returns:

flag : bool

Indicates whether missing values are detected. If true, missing values are detected. Otherwise not.

dataclean.identify_missing_mechanism(df=None)¶

Tries to guess the missing mechanism of the dataset.

Missing mechanism is not really testable. There may be reasons to suspect that the dataset belongs to one missing mechanism based on the missing correlation between features, but the result is not definite. Relevant information are provided to help the user make the decision. Three missng mechanisms to be guessed: MCAR: Missing completely at ramdom MAR: Missing at random MNAR: Missing not at random (not available here, normally involes field expert)

Parameters:

df : DataFrame

Raw data formatted in DataFrame.

dataclean.identify_outliers(df, algorithm=0, detailed=False)¶

Identifies outliers in multi dimension.

Dataset has to be parsed as numeric beforehand.

dataclean.infer_feature_type(feature)¶

Infer data types for the given feature using simple logic.

Possible data types to infer: boolean, date, float, integer, string Feature that is not either a boolean, a date, a float or an integer, is classified as a string.

Parameters:

feature : array-like

A feature/attribute vector.

Returns:

data_type : string

The data type of the given feature/attribute.

dataclean.missing_preprocess(features, df=None)¶

Drops the redundant information.

Redundant information is dropped before imputation. Detects and drops empty rows. Detects features and instances with extreme large proportion of missing data and reports to the user.

Parameters:

features : list

List of feature names.

df : DataFrame

Returns:

df : DataFrame

New DataFrame where redundant information may have been deleted.

features_new: list

List of feature names after preprocessing.

dataclean.plot_feature_importances(dataset_name, features, importances, indices)¶: Plot the 15 most important features.

dataclean.predict_best_anomaly_algorithm(X, y)¶

Predicts best anomaly detection algorithm.

Recommends the best anomaly detection algorithm to the user given the characteristics of the dataset. The following algorithms are considered: 0: isolation forest; 1: local outlier factor; 2: one class support vector machine.

dataclean.show_important_features(X, y, data_name, features)¶

Show the most important features of the given dataset.

Computes the most important features of the given dataset using random forest, and present the 15 most useful features to the user with a bar chart.

Parameters:

X : array-like, shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples,)

Target values (class labels in classification, real numbers in regression).

data_name : string

Dataset name.

features : list

List of feature names.

dataclean.show_statistical_info(Xy)¶

Show statistical information of the given dataset

Parameters:	Xy : array-like

dataclean.train_metalearner()¶: Train metalearner

dataclean.unify_name_consistency(names)¶

Unify inconsistent column names.

Parameters:

names : list

List of original column names.

Returns:

names : list

Unified column names.

dataclean.visualize_missing(df=None)¶

Visualize missing values.

The missingness of the dataset is visualized in bar chart, matrix and heatmap.

dataclean.visualize_outliers_parallel_coordinates(df_scaled, df_pred)¶: Visualizes high-dimensional outliers with a parallel coordinates plot.

dataclean.visualize_outliers_scatter(df, df_pred)¶

Visualizes high-dimensional outliers with a scatter plot.

Selects out the two features most likely to have outliers and shows them in a scatter plot.