The first step is to acquire data from OpenML.
import openml as oml import datacleanbot.dataclean as dc import numpy as np data = oml.datasets.get_dataset(id) # id: openml dataset id X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array') Xy = np.concatenate((X,y.reshape((y.shape,1))), axis=1)
Show Impotant Features¶
datacleanbot computes the most important features of
the given dataset using random forest and present the
15 most useful features to the user.
dc.show_important_features(X, y, data.name, features)
Unify Column Names¶
Inconsistent capitalization of column names can be detected and reported to the user. Users can decide whether to unify them or not. The capitalization can be unified to either upper case or lower case.
Show Statistical Inforamtion¶
datacleanbot can present the statistical information to
help users gain a better understanding of the data
Discover Data Types¶
datacleanbot can discover feature data types.
Basic data types discovered are ‘datetime’, ‘float’, ‘integer’,
‘bool’ and ‘string’.
datacleanbot also can discover statistical data types (real, positive real,
categorical and count) using Bayesian Model abda.
Clean Duplicated Rows¶
datacleanbot detects the duplicated records and reports them to users.
Handle Missing Values¶
datacleanbot identifies characters ‘n/a’, ‘na’, ‘–’ and ‘?’ as missing values.
Users can add extra characters to be considered as missing. After the missing
values being detected,
datacleanbot will present the missing values in effective
visualizations to help users identify the missing mechanism. Afterwards,
recommends the appropriate approach to clean missing values according the missing
features, Xy = dc.handle_missing(features, Xy)
A meta-learner is trained beforehand to recommend the outlier detection algorithm according to the meta features og the given dataset. Users can apply the recommended algorithm or any other available algorithm to detect outliers. After the detection, outliers will be present to users in effective visualizations and users can choose to drop them or not.
Xy = dc.handle_outlier(features, Xy)