datacleanbot

Welcome to the documentation of the datacleanbot Python API. datacleanbot offers automated, data-driven support to help users clean data effectively and smoothly. Given a random raw dataset representing a machine learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way. datacleanbot is designed with a strong connection to OpenML which is a platform where people can easily share data, experiments and machine learning models. Users can easily acquire datasets from OpenML with the dataset ID and clean them with datacleanbot.

User’s Guide

Usage

Acquire Data

The first step is to acquire data from OpenML.

import openml as oml
import datacleanbot.dataclean as dc
import numpy as np

data = oml.datasets.get_dataset(id) # id: openml dataset id
X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)

Show Impotant Features

datacleanbot computes the most important features of the given dataset using random forest and present the 15 most useful features to the user.

dc.show_important_features(X, y, data.name, features)

Unify Column Names

Inconsistent capitalization of column names can be detected and reported to the user. Users can decide whether to unify them or not. The capitalization can be unified to either upper case or lower case.

dc.unify_name_consistency(features)

Show Statistical Inforamtion

datacleanbot can present the statistical information to help users gain a better understanding of the data distribution.

dc.show_statistical_info(Xy)

Discover Data Types

datacleanbot can discover feature data types. Basic data types discovered are ‘datetime’, ‘float’, ‘integer’, ‘bool’ and ‘string’. datacleanbot also can discover statistical data types (real, positive real, categorical and count) using Bayesian Model abda.

dc.discover_types(Xy)

Clean Duplicated Rows

datacleanbot detects the duplicated records and reports them to users.

dc.clean_duplicated_rows(Xy)

Handle Missing Values

datacleanbot identifies characters ‘n/a’, ‘na’, ‘–’ and ‘?’ as missing values. Users can add extra characters to be considered as missing. After the missing values being detected, datacleanbot will present the missing values in effective visualizations to help users identify the missing mechanism. Afterwards, datacleanbot recommends the appropriate approach to clean missing values according the missing mechanism.

features, Xy = dc.handle_missing(features, Xy)

Outlier Detection

A meta-learner is trained beforehand to recommend the outlier detection algorithm according to the meta features og the given dataset. Users can apply the recommended algorithm or any other available algorithm to detect outliers. After the detection, outliers will be present to users in effective visualizations and users can choose to drop them or not.

Xy = dc.handle_outlier(features, Xy)

Example

Example_autoclean

[4]:
import datacleanbot.dataclean as dc
import openml as oml
import numpy as np
[5]:
# acquire data
data = oml.datasets.get_dataset(51)
X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)
[6]:
# input openml dataset id
Xy = dc.autoclean(Xy, data.name, features)

Important Features

_images/Example_autoclean_3_1.png

Statistical Information

0 1 2 3 4 5 6 7 8 9 10 11 12 13
count 294.000000 3.0 294.000000 271.000000 293.000000 286.000000 294.000000 293.000000 294.000000 104.000000 28.000000 293.000000 293.000000 294.000000
mean 47.826531 0.0 1.867347 250.848708 0.303754 0.930070 0.586054 1.156997 0.724490 1.105769 1.035714 139.129693 132.583618 0.360544
std 7.811812 0.0 0.956077 67.657711 0.460665 0.255476 0.908648 0.417011 0.447533 0.338995 0.881167 23.589749 17.626568 0.480977
min 28.000000 0.0 0.000000 85.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 82.000000 92.000000 0.000000
25% 42.000000 0.0 1.000000 209.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 122.000000 120.000000 0.000000
50% 49.000000 0.0 2.000000 243.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 140.000000 130.000000 0.000000
75% 54.000000 0.0 3.000000 282.500000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 155.000000 140.000000 1.000000
max 66.000000 0.0 3.000000 603.000000 1.000000 1.000000 5.000000 2.000000 1.000000 2.000000 2.000000 190.000000 200.000000 1.000000

Discover Data Types

Simple Data Types

['int64', 'int64', 'int64', 'int64', 'int64', 'int64', 'int64', 'int64', 'bool', 'float64', 'float64', 'int64', 'int64', 'bool']

Statistical Data Types

['Type.POSITIVE', 'Type.CATEGORICAL', 'Type.CATEGORICAL', 'Type.POSITIVE', 'Type.COUNT', 'Type.CATEGORICAL', 'Type.POSITIVE', 'Type.COUNT', 'Type.COUNT', 'Type.CATEGORICAL', 'Type.CATEGORICAL', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.CATEGORICAL']

Duplicated Rows

Identifying Duplicated Rows ...

Duplicated rows are detected.

       0   1    2   3    4    5    6    7    8   9   10     11     12   13
101  49.0 NaN  3.0 NaN  0.0  1.0  0.0  1.0  0.0 NaN NaN  160.0  110.0  0.0
102  49.0 NaN  3.0 NaN  0.0  1.0  0.0  1.0  0.0 NaN NaN  160.0  110.0  0.0

Do you want to drop the duplicated rows? [y/n]y

Duplicated rows are dropped.

Inconsitent Column Names


Column names
============
['age', 'sex', 'chest_pain', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

Column names are consistent

Missing values

Identify Missing Data ...

The default setting of missing characters is ['n/a', 'na', '--', '?']
Do you want to add extra character? [y/n]n

Missing values detected!

Number of missing in each feature
0       0
1     290
2       0
3      22
4       1
5       8
6       0
7       1
8       0
9     189
10    265
11      1
12      1
13      0
dtype: int64

Records containing missing values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 28.0 NaN 3.0 132.0 0.0 1.0 0.0 0.0 1.0 NaN NaN 185.0 130.0 0.0
1 29.0 NaN 3.0 243.0 0.0 1.0 0.0 1.0 1.0 NaN NaN 160.0 120.0 0.0
2 29.0 NaN 3.0 NaN 0.0 1.0 0.0 1.0 1.0 NaN NaN 170.0 140.0 0.0
3 30.0 NaN 0.0 237.0 0.0 1.0 0.0 2.0 0.0 NaN 0.0 170.0 170.0 0.0
4 31.0 NaN 3.0 219.0 0.0 1.0 0.0 2.0 0.0 NaN NaN 150.0 100.0 0.0

Missing correlation between features containing missing values and other features
1 3 4 5 7 9 10 11 12
0 -0.054393 0.019737 0.001330 -0.014961 0.053771 -0.234171 0.001532 0.001330 0.001330
1 1.000000 -0.099671 0.005952 0.017041 0.005952 -0.004595 0.082259 0.005952 0.005952
2 0.020988 0.067940 0.069733 0.023981 -0.114337 0.342524 0.075190 0.069733 0.069733
3 -0.099671 1.000000 -0.016674 -0.047736 -0.016674 0.076025 0.048563 -0.016674 -0.016674
4 0.005952 -0.016674 1.000000 -0.009805 -0.003425 -0.078890 0.019022 1.000000 1.000000
5 0.017041 -0.047736 -0.009805 1.000000 -0.009805 -0.007021 -0.088011 -0.009805 -0.009805
6 0.009863 -0.077552 0.091000 -0.039312 -0.037900 -0.841642 0.005952 0.091000 0.091000
7 0.005952 -0.016674 -0.003425 -0.009805 1.000000 0.043410 0.019022 -0.003425 -0.003425
8 -0.062333 0.000198 -0.095489 -0.085351 0.035864 -0.038358 -0.016808 -0.095489 -0.095489
9 -0.004595 0.076025 -0.078890 -0.007021 0.043410 1.000000 0.025752 -0.078890 -0.078890
10 0.082259 0.048563 0.019022 -0.088011 0.019022 0.025752 1.000000 0.019022 0.019022
11 0.005952 -0.016674 1.000000 -0.009805 -0.003425 -0.078890 0.019022 1.000000 1.000000
12 0.005952 -0.016674 1.000000 -0.009805 -0.003425 -0.078890 0.019022 1.000000 1.000000
Missing mechanism is probably missing at random

Visualize Missing Data ...


_images/Example_autoclean_3_27.png
_images/Example_autoclean_3_28.png
_images/Example_autoclean_3_29.png

Clean Missing Data ...

Feature [1, 10] has extreme large proportion of missing data
Do you want to delete the above features? [y/n]y

Choose the missing mechanism [a/b/c/d]:
a.MCAR b.MAR c.MNAR d.Skip
b
Imputation score of knn is 0.7567397233586597
Imputation score of matrix factorization is 0.7567397233586597
Imputation score of multiple imputation is 0.8122681667640756
Imputation method with the highest socre is multiple imputation

Recommended Approach!
The recommended approach is multiple imputation
Do you want to apply the recommended approach? [y/n]y

Applying multiple imputation ...
Missing values cleaned!

Outliers

Recommend Algorithm ...

The recommended approach is isolation forest.
Do you want to apply the recommended outlier detection approach? [y/n]y

Visualize Outliers ...

_images/Example_autoclean_3_38.png
0 1 2 3 4 5 6 7 8 9 10 11 anomaly_score
232 48 1 275 1 0 2 2 1 0 150 122 1 -0.119937
254 46 0 272 0 0 2 1 1 1 175 140 1 -0.0797964
275 59 1 264 1 0 0 0 1 0.944132 119 140 1 -0.0689509
90 48 3 308 0.141883 1 2 2 0 2 147.257 139.446 0 -0.068584
220 59 1 338 1 0 1.5 2 0 1 130 130 1 -0.0657597
291 58 3 393 1 1 1 1 0 1 110 180 1 -0.0580726
248 58 2 211 0 0 0 2 1 1.15535 92 160 1 -0.0565477
3 30 0 237 0 1 0 2 0 1.32164 170 170 0 -0.0521636
117 51 2 220 1 1 2 1 0 2 160 130 0 -0.0464553
223 65 1 306 1 0 1.5 1 1 1 87 140 1 -0.0440714
94 48 1 163 0 1 2 1 0 2 175 108 0 -0.0405723
268 55 3 292 1 0 2 1 1 1 143 160 1 -0.0397377
171 57 1 347 1 1 0.8 2 0 1 126 180 0 -0.0360816
242 54 1 603 1 0 1 1 1 1 125 130 1 -0.0355075
146 54 0 171 0 1 2 1 1 2 137 120 0 -0.0348489
276 65 1 263 1 0 2 1 1 1 112 170 1 -0.0329883
273 58 3 164 1 1 2 2 1 1 99 136 1 -0.0309242
12 35 0 160 0 1 0 2 0 1.31709 185 120 0 -0.0297764
154 54 1 365 0 1 1 2 1 2 134 150 0 -0.0297346
157 55 3 394 0 1 0 0 0 1.43866 150 130 0 -0.0280174
289 54 2 294 1 1 0 2 0 1 100 130 1 -0.026562
0 28 3 132 0 1 0 0 1 1.42619 185 130 0 -0.0263437
31 39 3 224.323 0 1 2 2 1 2 146 120 0 -0.0256593
263 52 1 246 1 1 4 2 1 1 82 160 1 -0.0254428
185 62 0 193 0 1 0 1 0 1.36576 116 160 0 -0.0246483
170 57 0 308 0 1 1 1 0 1 98 130 0 -0.0235655
227 40 1 392 0 1 2 1 0 1 130 150 1 -0.0210837
290 56 1 342 1 0 3 1 1 1 150 155 1 -0.0182647
130 53 3 468 0 0.864092 0 1 0 1.37019 127 113 0 -0.0181947
91 48 3 256.72 0 0 0 2 0 1.3819 148 120 0 -0.0151125
285 50 1 231 1 1 5 2 1 1 140 140 1 -0.0115976
183 61 1 294 1 1 1 2 0 1 120 130 0 -0.0100509
255 47 2 248 0 0 0 1 0 1.16329 170 135 1 -0.00722342
37 39 2 147 0 0 0 1 1 1.36338 160 160 0 -0.00674078
89 47 1 276 1 0 0 1 1 1.11308 125 140 0 -0.00241484
195 38 1 117 1 1 2.5 1 1 1 134 92 1 -0.00238595
168 56 2 276 1 1 1 1 1 2 128 130 0 -0.00109016
150 54 3 195 0 1 1 2 1 2 130 160 0 -0.000929664
205 48 1 263 0 0 0 1 1 1.00107 110 106 1 0.00114704
250 41 1 172 0 1 2 2 1 1 130 130 1 0.00212158
196 40 1 466 1 0.86451 1 1 1 1 152 120 1 0.00309686
188 33 1 246 1 1 1 1 0 1 150 100 1 0.00348012
118 51 2 200 0 1 0.5 1 0 2 120 150 0 0.0043416
131 53 3 216 1 1 2 1 0 1 142 140 0 0.00484593
224 32 1 529 0 1 0 1 1 0.99009 130 118 1 0.00797082
282 47 1 291 1 1 3 2 1 1 158 160 1 0.010827
246 56 1 213 1 0 1 1 1 1 125 150 1 0.011012
74 45 3 224 0 0 0 1 1 1.35054 122 140 0 0.0124018
281 47 1 205 1 1 2 1 0 1 98 120 1 0.013517
140 54 3 230 0 0 0 1 0 1.38866 140 120 0 0.013915
252 44 3 288 1 1 3 1 1 1 150 150 1 0.0143166
59 43 0 223 0 1 0 1 0 1.24638 142 100 0 0.0148435
228 43 0 291 0 1 0 2 1 1.07937 155 120 1 0.0149449
84 46 1 280 0 1 0 2 1 1.36972 120 180 0 0.0158468
22 37 1 173 0 1 0 2 0 1.38258 184 130 0 0.0158852
14 35 3 308 0 1 0 0 1 1.39788 180 120 0 0.0173538
4 31 3 219 0 1 0 2 0 1.3727 150 100 0 0.0187209
272 56 1 388 1 1 2 2 1 1 122 170 1 0.0187329
184 61 1 292 1 1 0 2 1 1.21704 115 125 0 0.0191592
218 57 3 265 1 1 1 2 1 1 145 140 1 0.0224086
186 62 3 271 0 1 1 1 1 2 152 140 0 0.0224566
265 53 1 285 1 1 1.5 2 1 1 120 180 1 0.0235483
158 55 3 256 0 0 0 1 1 1.37734 137 120 0 0.0238808
35 39 3 241 0 1 0 1 1 1.43285 106 190 0 0.0244046
172 57 3 260 0 0 0 1 1 1.41142 140 140 0 0.0244428
17 36 2 340 0 1 1 1 1 1 184 112 0 0.0256382
260 52 1 342 1 1 1 2 1 1 96 112 1 0.0256705
213 51 1 303 1 1 1 1 0 1 150 160 1 0.0263497
191 36 3 267 0 1 3 1 1 1 160 120 1 0.0272168
244 54 1 198 1 1 2 1 1 1 142 200 1 0.0278133
165 56 2 219 0 0.904268 0 2 0 1.46569 164 130 0 0.0278579
112 50 3 209 0 1 0 2 1 1.48383 116 170 0 0.0279521
109 50 1 328 1 1 1 1 0 1 110 120 0 0.0285039
125 52 1 180 1 1 1.5 1 0 1 140 130 0 0.0286704
143 54 3 309 0 0.889135 0 2 0 1.47009 140 140 0 0.0291233
23 37 3 283 0 1 0 2 1 1.34845 98 130 0 0.0291719
127 52 3 100 1 1 0 1 1 1.356 138 140 0 0.0303691
85 47 3 257 0 1 1 1 0 2 135 140 0 0.030758
72 45 3 244.979 0 1 0 1 0 1.53949 180 180 0 0.0310886
67 44 1 218 0 1 0 2 0 1.30464 115 120 0 0.0312676
141 54 3 273 0 1 1.5 1 0 1 150 120 0 0.0313078
95 48 1 254 0 1 0 2 0 1.30643 110 120 0 0.0314005
155 55 3 344 0 1 0 2 0 1.46327 160 110 0 0.0316206
189 34 0 156 0 1 0 1 1 1.1145 180 140 1 0.0317022
256 48 1 214 1 1 1.5 1 0 1 108 138 1 0.0319747
264 53 2 518 0 1 0 1 1 1.15593 130 145 1 0.0321085
288 52 1 331 1 1 2.5 1 1 0.96057 94 160 1 0.0322196
30 39 2 182 0 1 0 2 0 1.41003 180 110 0 0.0322735
211 50 2 288 1 1 0 1 0 1.06802 140 140 1 0.0325019
292 65 1 275 1 1 1 2 1 1 115 130 1 0.032828
136 53 1 260 1 1 3 2 1 1 112 124 0 0.0339107
247 57 1 255 1 1 3 1 1 1 92 150 1 0.0339626
139 54 3 221 0 1 1 1 0 2 138 120 0 0.0344146
32 39 3 200 1 1 1 1 1 1 160 120 0 0.0366313
177 59 3 188 0 1 1 1 0 1 124 130 0 0.0368223
86 47 2 241.057 0 1 2 1 0 1 145 130 0 0.0369351
208 49 2 180 0 1 1 1 0 1 156 160 1 0.0377091
78 46 1 238 0 1 0 1 0 1.2769 90 130 0 0.0380135
278 41 1 336 1 1 3 1 1 1 118 120 1 0.0391171
180 59 2 213 0 1 0 1 1 1.44776 100 180 0 0.0392088
182 60 2 246 0 1 0 0 1 1.40395 135 120 0 0.0408451
79 46 3 275 1 1 0 1 1 1.32789 165 140 0 0.0412702
233 48 1 193 1 1 3 1 1 1 102 160 1 0.04237
96 48 1 227 1 1 1 1 0 1 130 150 0 0.0457815
286 50 1 341 1 1 2.5 2 1 1 125 140 1 0.0464844
277 66 1 276.836 1 1 1 1 1 1 94 140 1 0.0468952
267 55 0 295 0 1 0 1.11432 1 1.1145 136 140 1 0.0469683
61 43 3 215 0 1 0 2 0 1.47417 175 120 0 0.0490634
9 34 3 161 0 1 0 1 0 1.46804 190 130 0 0.0499133
234 48 1 329 1 1 1.5 1 1 1 92 160 1 0.049914
10 34 3 214 0 1 0 2 1 1.46008 168 150 0 0.050185
82 46 1 238 1 1 1 2 1 1 140 110 0 0.0513488
270 56 3 279 0 1 1 1 0 1 150 120 1 0.0514012
221 60 1 248 0 1 1 1 1 1 125 100 1 0.0528763
280 44 1 491 0 1 0 1 1 1.07103 135 135 1 0.0529694
235 48 1 355 1 1 2 1 1 1 99 160 1 0.0532494
271 56 1 230 1 1 1.5 2 1 1 124 150 1 0.053556
162 55 2 220 0 1 0 0 1 1.38884 134 120 0 0.0541677
62 43 3 249 0 1 0 2 0 1.46809 176 120 0 0.0545522
103 49 2 207 0 1 0 2 0 1.41247 135 130 0 0.0554622
229 45 1 219 1 1 1 2 1 1 130 130 1 0.057132
52 42 2 211 0 1 0 2 0 1.36915 137 115 0 0.05732
45 41 3 250 0 1 0 2 0 1.40704 142 110 0 0.0591437
106 49 1 297 0 0.93087 1 1 1 1 132 120 0 0.0593337
70 44 1 412 0 1 0 1 1 1.34646 170 150 0 0.0608182
187 31 1 270 1 1 1.5 1 1 1 153 120 1 0.0609306
284 49 1 222 0 1 2 1 1 1 122 150 1 0.0609557
142 54 3 253 0 1 0 2 0 1.49633 155 130 0 0.0612764
266 54 1 216 0 1 1.5 1 1 1 105 140 1 0.0616474
145 54 3 312 0 1 0 1 0 1.47594 130 160 0 0.0620174
115 51 3 194 0 1 0 1 0 1.53814 170 160 0 0.0643856
219 58 2 213 0 1 0 2 1 1.24797 140 130 1 0.0646421
259 51 2 160 0 1 2 1 1 1 150 135 1 0.0650465
56 42 2 228 1 1 1.5 1 1 1 152 120 0 0.0666529
217 55 1 201 1 1 3 1 1 1 130 140 1 0.0689589
151 54 3 305 0 1 0 1 1 1.5261 175 160 0 0.0689688
13 35 1 167 0 1 0 1 0 1.33391 150 140 0 0.0695931
97 48 3 240.484 0 1 0 1 1 1.35496 100 100 0 0.0698431
179 59 2 318 1 1 1 1 1 1 120 130 0 0.0702111
240 54 2 237 1 1 1.5 1 1 1.06656 150 120 1 0.070729
241 54 1 242 1 1 1 1 1 1 91 130 1 0.0725249
129 52 2 259 0 1 0 2 1 1.46127 170 140 0 0.0725429
198 41 1 237 1 0.939522 1 1 1 1 138 120 1 0.0725942
283 49 1 212 1 1 0 1 1 0.956925 96 128 1 0.0733429
48 41 3 291 0 1 0 2 1 1.42586 160 120 0 0.073357
206 48 1 260 0 1 2 1 1 1 115 120 1 0.0735847
251 43 1 175 1 1 1 1 1 1 120 120 1 0.0736604
216 54 1 224 0 1 2 1 1 1 122 125 1 0.0743512
27 38 3 275 0 1.00804 0 1 0 1.37389 129 120 0 0.0744339
83 46 1 240 0 1 0 2 1 1.32033 140 110 0 0.0748799
253 44 1 290 1 1 2 1 1 1 100 130 1 0.0750852
65 43 2 240.056 0 1 0 1 0 1.44147 175 150 0 0.0752259
5 32 3 198 0 1 0 1 0 1.39257 165 105 0 0.0754935
64 43 3 186 0 1 0 1 0 1.47751 154 150 0 0.0760975
116 51 2 190 0 1 0 1 0 1.36956 120 110 0 0.0768654
68 44 3 184 0 1 1 1 1 1 142 120 0 0.077039
269 55 1 248 1 1 2 1 1 1 96 145 1 0.0784453
222 63 1 223 0 1 0 1 1 1.19586 115 150 1 0.0786586
144 54 3 230 0 1 0 1 0 1.48177 130 150 0 0.0788217
204 47 1 226 1 1 1.5 1 1 1 98 150 1 0.078888
262 52 1 404 1 1 2 1 1 1 124 140 1 0.0790452
46 41 3 184 0 1 0 1 0 1.47235 180 125 0 0.0791359
73 45 1 297 0 1 0 1 0 1.32829 144 132 0 0.0798444
238 52 1 273.523 1 1 1.5 1 1 1 126 170 1 0.0799551
58 42 1 358 0 1 0 1 1 1.3385 170 140 0 0.0804819
114 50 1 215 1 1 0 1 1 1.23782 140 150 0 0.0805382
192 37 1 207 1 1 1.5 1 1 1 130 140 1 0.0807155
169 56 1 85 0 1 0 1 1 1.39164 140 120 0 0.0808652
203 47 2 193 1 1 1 1 1 1 145 140 1 0.0824245
207 48 1 268 1 1 1 1 1 1 103 160 1 0.0824746
8 33 2 298 0 1 0 1 1 1.36098 185 120 0 0.0830886
6 32 3 225 0 1 0 1 1 1.40973 184 110 0 0.0831834
156 55 3 320 0 1 0 1 0 1.46382 155 122 0 0.0851178
87 47 0 249 0 1 0 1 1 1.27185 150 110 0 0.0854966
63 43 3 266 0 1 0 1 0 1.38138 118 120 0 0.0855935
11 34 3 220 0 1 0 1 1 1.3632 150 98 0 0.0861795
199 43 1 247 1 1 2 1 1 1 130 150 1 0.0869482
230 46 1 231 1 1 0 1 1 0.954839 115 120 1 0.0870761
200 46 1 202 1 1 0 1 1 0.991815 150 110 1 0.0879442
132 53 2 274 0 1 0 1 0 1.38323 130 120 0 0.0884486
93 48 2 195 0 1 0 1 0 1.37463 125 120 0 0.0886647
81 46 2 163 0 0.995578 0 1 1 1.39172 116 150 0 0.0887371
88 47 3 263 0 1 0 1 1 1.50662 174 160 0 0.0891336
225 38 1 258.901 1 1 1 1 1 1 150 110 1 0.0905199
166 56 3 184 0 1 0 1 1 1.43349 100 130 0 0.090794
16 36 3 166 0 1 0 1 1 1.44486 180 120 0 0.0917647
104 49 3 253 0 1 0 1 1 1.44605 174 100 0 0.0918137
128 52 3 196 0 1 0 1 1 1.52954 165 160 0 0.0920144
239 53 1 246 1 1 0 1 1 0.980103 116 120 1 0.0920738
124 52 2 272 0 1 0 1 0 1.39657 139 125 0 0.0924291
249 58 1 263 1 1 2 1 1 1 140 130 1 0.0925944
164 55 1 229 1 1 0.5 1 1 1 110 140 0 0.0933395
92 48 3 284 0 1 0 1 0 1.39942 120 120 0 0.0933896
176 58 1 222 0 1 0 1 1 1.3391 100 135 0 0.0943583
279 43 1 288 1 1 2 1 1 1 135 140 1 0.0953903
19 36 2 160 0 1 0 1 1 1.42173 172 150 0 0.0962968
121 51 1 179 0 1 0 1 1 1.31518 100 130 0 0.0964982
160 55 3 326 0 1 0 1 1 1.48357 155 145 0 0.0965331
20 37 3 260 0 1 0 1 0 1.37387 130 120 0 0.0966762
174 58 3 251 0 1 0 1 1 1.43906 110 130 0 0.0969323
243 54 1 274.224 1 1 0 1 1 1.00388 118 140 1 0.0969578
44 40 2 233.377 0 1 0 1 1 1.42926 188 140 0 0.0970642
36 39 2 339 0 1 0 1 1 1.35734 170 120 0 0.097233
257 49 1 341 1 1 1 1 1 1 120 130 1 0.0981915
102 49 3 201 0 1 0 1 0 1.47926 164 124 0 0.0998013
274 59 1 263.489 0 1 0 1 1 1.16023 125 130 1 0.100086
123 52 3 245.244 0 1 0 1 0 1.47111 140 140 0 0.100332
21 37 2 211 0 1 0 1 0 1.36075 142 130 0 0.101185
201 46 1 186 0 1 0 1 1 1.1109 124 118 1 0.101386
261 52 1 298 1 1 1 1 1 1 110 130 1 0.101688
26 37 1 315 0 1 0 1 1 1.30192 158 130 0 0.1017
202 46 1 277 1 1 1 1 1 1 125 120 1 0.101707
190 35 3 257 0 1 0 1 1 1.16276 140 110 1 0.102072
60 43 3 201 0 1 0 1 0 1.4524 165 120 0 0.10245
245 55 1 268 1 1 1.5 1 1 1 128 140 1 0.102462
71 45 3 237 0 1 0 1 0 1.47029 170 130 0 0.102634
236 50 1 233 1 1 2 1 1 1 121 130 1 0.102873
113 50 1 129 0 1 0 1 1 1.37627 135 140 0 0.103391
29 38 2 292 0 1 0 1 1 1.34433 130 145 0 0.103463
101 49 3 237.575 0 1 0 1 0 1.4501 160 110 0 0.103521
214 52 1 225 1 1 2 1 1 1 120 130 1 0.103535
209 49 2 265 0 1 0 1 1 1.21401 175 115 1 0.103773
193 38 1 196 0 1 0 1 1 1.1192 166 110 1 0.103921
57 42 2 147 0 1 0 1 1 1.42807 146 160 0 0.103971
2 29 3 234.165 0 1 0 1 1 1.41433 170 140 0 0.103982
134 53 3 320 0 1 0 1 1 1.47969 162 140 0 0.104802
197 41 1 289 0 1 0 1 1 1.1158 170 110 1 0.104835
38 39 1 273 0 1 0 1 1 1.26364 132 110 0 0.104857
175 58 2 179 0 1 0 1 1 1.47702 160 140 0 0.104863
108 50 3 202 0 1 0 1 0 1.44341 145 110 0 0.104991
122 52 3 210 0 1 0 1 0 1.46488 148 120 0 0.105036
15 35 3 264 0 1 0 1 1 1.44063 168 150 0 0.105069
287 52 1 266 1 1 2 1 1 1 134 140 1 0.10621
178 59 3 287 0 1 0 1 1 1.49556 150 140 0 0.106245
39 39 1 307 0 1 0 1 1 1.28957 140 130 0 0.106678
194 38 1 282 0 1 0 1 1 1.11737 170 120 1 0.107229
110 50 3 168 0 1 0 1 1 1.47467 160 120 0 0.107247
149 54 3 246 0 1 0 1 1 1.4128 110 120 0 0.108111
258 49 1 234 1 1 1 1 1 1 140 140 1 0.108172
1 29 3 243 0 1 0 1 1 1.37679 160 120 0 0.108396
231 46 1 222 0 1 0 1 1 1.1027 112 130 1 0.108427
47 41 3 245 0 1 0 1 0 1.42871 150 130 0 0.108429
18 36 2 209 0 1 0 1 1 1.39501 178 130 0 0.108539
226 39 1 280 0 1 0 1 1 1.08565 150 110 1 0.10873
34 39 3 240.837 0 1 0 1 1 1.37937 120 130 0 0.110684
75 45 2 243.4 0 1 0 1 1 1.34597 110 135 0 0.110873
105 49 2 187 0 1 0 1 1 1.45483 172 140 0 0.11151
153 54 2 245.877 0 1 0 1 1 1.4127 122 150 0 0.111691
126 52 3 284 0 1 0 1 1 1.40657 118 120 0 0.111722
41 40 3 289 0 1 0 1 1 1.44785 172 140 0 0.112222
181 59 1 242.428 0 1 0 1 1 1.39307 140 140 0 0.112398
167 56 2 244.212 0 1 0 1 1 1.38763 114 130 0 0.112951
111 50 3 216 0 1 0 1 1 1.50003 170 140 0 0.112987
55 42 3 268 0 1 0 1 1 1.42817 136 150 0 0.114312
99 48 3 238 0 1 0 1 1 1.42436 118 140 0 0.114949
51 41 1 250 0 1 0 1 1 1.29086 142 112 0 0.115035
210 49 1 206 0 1 0 1 1 1.18827 170 130 1 0.115295
25 37 1 223 0 1 0 1 1 1.32205 168 120 0 0.115366
173 58 3 230 0 1 0 1 1 1.49214 150 130 0 0.11884
49 41 3 295 0 1 0 1 1 1.42452 170 120 0 0.119412
212 50 1 264 0 1 0 1 1 1.17306 150 145 1 0.120193
159 55 3 196 0 1 0 1 1 1.4995 150 140 0 0.120452
237 52 1 182 0 1 0 1 1 1.16905 150 120 1 0.120626
215 54 1 216 0 1 0 1 1 1.16328 140 125 1 0.121607
66 43 3 207 0 1 0 1 1 1.43818 138 142 0 0.121936
161 55 2 277 0 1 0 1 1 1.40906 160 110 0 0.122253
28 38 3 297 0 1 0 1 1 1.41162 150 140 0 0.123202
7 32 3 254 0 1 0 1 1 1.38592 155 125 0 0.124091
163 55 1 270 0 1 0 1 1 1.34807 140 120 0 0.124804
43 40 2 281 0 1 0 1 1 1.38178 167 130 0 0.125651
147 54 3 208 0 1 0 1 1 1.44806 142 110 0 0.126332
107 49 1 241.799 0 1 0 1 1 1.34211 130 140 0 0.126964
138 53 1 243 0 1 0 1 1 1.3878 155 140 0 0.127054
148 54 3 238 0 1 0 1 1 1.46795 154 120 0 0.127242
100 48 2 211 0 1 0 1 1 1.36923 138 110 0 0.127324
137 53 1 182 0 1 0 1 1 1.38062 148 130 0 0.128076
50 41 3 269 0 1 0 1 1 1.4044 144 125 0 0.128387
133 53 3 240.445 0 1 0 1 1 1.43681 132 120 0 0.12858
42 40 2 215 0 1 0 1 1 1.36072 138 130 0 0.128797
119 51 3 188 0 1 0 1 1 1.46193 145 125 0 0.129163
54 42 3 198 0 1 0 1 1 1.431 155 120 0 0.130047
135 53 2 195 0 1 0 1 1 1.40632 140 120 0 0.130304
77 45 1 224 0 1 0 1 1 1.34735 144 140 0 0.130991
33 39 3 204 0 1 0 1 1 1.40589 145 120 0 0.131109
24 37 2 194 0 1 0 1 1 1.36811 150 130 0 0.13112
40 40 3 275 0 1 0 1 1 1.41238 150 130 0 0.131227
152 54 2 217 0 1 0 1 1 1.40185 137 120 0 0.131401
76 45 1 225 0 1 0 1 1 1.31877 140 120 0 0.131506
120 51 3 224 0 1 0 1 1 1.46616 150 130 0 0.131625
69 44 3 215 0 1 0 1 1 1.42261 135 130 0 0.131691
53 42 3 196 0 1 0 1 1 1.42536 150 120 0 0.132573
98 48 3 245 0 1 0 1 1 1.46212 160 130 0 0.133118
80 46 2 230 0 1 0 1 1 1.38369 150 120 0 0.136404
_images/Example_autoclean_3_40.png
_images/Example_autoclean_3_41.png

Drop Outliers ...

Do you want to drop outliers? [y/n]y
Outliers are dropped.
[ ]:

Example_tasks

[2]:
# import datacleanbot and openml
import datacleanbot.dataclean as dc
import openml as oml
import numpy as np
Preparation: Acquire Data

The first step is to acquire data from OpneML. The dataset ID can be found in the address.

[3]:
# acquire dataset with dataset ID 4
data = oml.datasets.get_dataset(4)
X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)
Task 1: Show Important Features
[4]:
dc.show_important_features(X, y, data.name, features)

Important Features

_images/Example_tasks_5_1.png
Task 2: Unify Column Names
[5]:
features = dc.unify_name_consistency(features)

Inconsitent Column Names


Column names
============
['duration', 'wage-increase-first-year', 'wage-increase-second-year', 'wage-increase-third-year', 'cost-of-living-adjustment', 'working-hours', 'pension', 'standby-pay', 'shift-differential', 'education-allowance', 'statutory-holidays', 'vacation', 'longterm-disability-assistance', 'contribution-to-dental-plan', 'bereavement-assistance', 'contribution-to-health-plan']

Column names are consistent
Task 3: Show Statistical Information
[6]:
dc.show_statistical_info(Xy)

Statistical Information

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
count 30.000000 37.000000 37.000000 37.000000 56.000000 22.000000 28.000000 27.000000 31.000000 9.000000 53.000000 51.000000 56.000000 46.000000 15.000000 51.000000 57.000000
mean 0.100000 1.108108 1.324324 0.594595 2.160714 0.545455 0.285714 1.037037 4.870968 7.444444 11.094340 0.960784 3.803571 3.971739 3.913333 38.039216 0.649123
std 0.305129 0.774015 0.818333 0.797895 0.707795 0.509647 0.460044 0.939782 4.544168 5.027701 1.259795 0.823669 1.370596 1.164028 1.304315 2.505680 0.481487
min 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 2.000000 9.000000 0.000000 2.000000 2.000000 2.000000 27.000000 0.000000
25% 0.000000 1.000000 1.000000 0.000000 2.000000 0.000000 0.000000 0.000000 3.000000 2.000000 10.000000 0.000000 2.500000 3.000000 2.400000 37.000000 0.000000
50% 0.000000 1.000000 2.000000 0.000000 2.000000 1.000000 0.000000 1.000000 4.000000 8.000000 11.000000 1.000000 4.000000 4.000000 4.600000 38.000000 1.000000
75% 0.000000 2.000000 2.000000 1.000000 3.000000 1.000000 1.000000 2.000000 5.000000 12.000000 12.000000 2.000000 4.500000 4.500000 5.000000 40.000000 1.000000
max 1.000000 2.000000 2.000000 2.000000 3.000000 1.000000 1.000000 2.000000 25.000000 14.000000 15.000000 2.000000 7.000000 7.000000 5.100000 40.000000 1.000000
Task 4: Discover Data Types
[7]:
# input can be Xy or X
dc.discover_types(Xy)

Discover Data Types

Simple Data Types

['float64', 'int64', 'float64', 'int64', 'int64', 'int64', 'float64', 'int64', 'int64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'int64', 'bool']

Statistical Data Types

['Type.POSITIVE', 'Type.REAL', 'Type.REAL', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.REAL', 'Type.POSITIVE', 'Type.POSITIVE', 'Type.REAL', 'Type.POSITIVE', 'Type.COUNT']
Task 5: Clean Duplicated Rows
[8]:
Xy = dc.clean_duplicated_rows(Xy)

Duplicated Rows

Identifying Duplicated Rows ...

No duplicated rows detected.

Task 6: Handle Missing Values
[9]:
features, Xy = dc.handle_missing(features, Xy)

Missing values

Identify Missing Data ...

The default setting of missing characters is ['n/a', 'na', '--', '?']
Do you want to add extra character? [y/n]n

Missing values detected!

Number of missing in each feature
0     27
1     20
2     20
3     20
4      1
5     35
6     29
7     30
8     26
9     48
10     4
11     6
12     1
13    11
14    42
15     6
16     0
dtype: int64

Records containing missing values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0.0 NaN NaN NaN 1.0 NaN NaN NaN 2.0 NaN 11.0 1.0 5.0 NaN NaN 40.0 1.0
1 NaN 2.0 2.0 NaN 2.0 0.0 NaN 1.0 NaN NaN 11.0 0.0 4.5 5.8 NaN 35.0 1.0
2 0.0 1.0 1.0 NaN NaN NaN 0.0 2.0 5.0 NaN 11.0 2.0 NaN NaN NaN 38.0 1.0
3 0.0 NaN NaN 2.0 3.0 0.0 NaN NaN NaN NaN NaN NaN 3.7 4.0 5.0 NaN 1.0
4 0.0 1.0 1.0 NaN 3.0 NaN NaN NaN NaN NaN 12.0 1.0 4.5 4.5 5.0 40.0 1.0

Missing correlation between features containing missing values and other features
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1.000000 0.112373 0.259620 0.185996 -0.126773 0.030389 0.018496 0.196296 -0.304456 -0.167360 0.014479 0.132569 -0.126773 0.070290 0.327569 -0.096414
1 0.112373 1.000000 0.460811 0.152703 -0.098247 0.280850 0.575362 0.182121 0.064742 -0.084895 0.085841 -0.012609 -0.098247 0.013074 -0.061512 -0.132393
2 0.259620 0.460811 1.000000 0.229730 -0.098247 0.129827 0.428296 0.182121 0.064742 0.015918 0.229751 0.346743 -0.098247 0.106224 0.105450 0.107175
3 0.185996 0.152703 0.229730 1.000000 0.181757 0.280850 0.354763 0.255745 -0.082870 0.116731 -0.201979 -0.012609 0.181757 0.013074 0.021969 -0.132393
4 -0.126773 -0.098247 -0.098247 0.181757 1.000000 0.105946 -0.135996 -0.140859 -0.122380 0.057864 -0.036711 -0.045835 1.000000 0.273268 0.079860 -0.045835
5 0.030389 0.280850 0.129827 0.280850 0.105946 1.000000 0.374342 0.113961 0.002539 0.150845 -0.064352 0.037082 0.105946 -0.160206 -0.146448 -0.197772
6 0.018496 0.575362 0.428296 0.354763 -0.135996 0.374342 1.000000 0.332923 0.195304 0.248198 -0.004820 -0.006018 -0.135996 -0.141967 -0.268444 -0.234718
7 0.196296 0.182121 0.182121 0.255745 -0.140859 0.113961 0.332923 1.000000 -0.259902 0.071001 0.123072 0.210905 -0.140859 -0.159324 -0.167984 -0.018078
8 -0.304456 0.064742 0.064742 -0.082870 -0.122380 0.002539 0.195304 -0.259902 1.000000 0.396558 0.299976 0.259754 -0.122380 -0.269331 -0.172611 0.374528
9 -0.167360 -0.084895 0.015918 0.116731 0.057864 0.150845 0.248198 0.071001 0.396558 1.000000 0.118958 0.148522 0.057864 -0.275913 -0.149514 0.148522
10 0.014479 0.085841 0.229751 -0.201979 -0.036711 -0.064352 -0.004820 0.123072 0.299976 0.118958 1.000000 0.800943 -0.036711 -0.134341 -0.147760 0.353357
11 0.132569 -0.012609 0.346743 -0.012609 -0.045835 0.037082 -0.006018 0.210905 0.259754 0.148522 0.800943 1.000000 -0.045835 -0.167729 -0.054661 0.441176
12 -0.126773 -0.098247 -0.098247 0.181757 1.000000 0.105946 -0.135996 -0.140859 -0.122380 0.057864 -0.036711 -0.045835 1.000000 0.273268 0.079860 -0.045835
13 0.070290 0.013074 0.106224 0.013074 0.273268 -0.160206 -0.141967 -0.159324 -0.269331 -0.275913 -0.134341 -0.167729 0.273268 1.000000 0.292239 -0.022872
14 0.327569 -0.061512 0.105450 0.021969 0.079860 -0.146448 -0.268444 -0.167984 -0.172611 -0.149514 -0.147760 -0.054661 0.079860 0.292239 1.000000 -0.054661
15 -0.096414 -0.132393 0.107175 -0.132393 -0.045835 -0.197772 -0.234718 -0.018078 0.374528 0.148522 0.353357 0.441176 -0.045835 -0.022872 -0.054661 1.000000
Missing mechanism is probably missing at random

Visualize Missing Data ...


_images/Example_tasks_15_12.png
_images/Example_tasks_15_13.png
_images/Example_tasks_15_14.png

Clean Missing Data ...


Choose the missing mechanism [a/b/c/d]:
a.MCAR b.MAR c.MNAR d.Skip
a
Missing percentage is 0.9824561403508771
Imputation score of mean is 0.8515151515151516
Imputation score of mode is 0.8674242424242424
Imputation score of knn is 0.9299242424242424
Imputation score of matrix factorization is 0.9299242424242424
Imputation score of multiple imputation is 0.9291666666666667
Imputation method with the highest socre is knn

Recommended Approach!
The recommended approach is knn
Do you want to apply the recommended approach? [y/n]n


Choose the approach you want to apply [a/b/c/d/e/skip]:
a.Mean b.Mode c.K Nearest Neighbor d.Matrix Factorization e. Multiple Imputation
a

Applying mean imputation ...
Missing values cleaned!
Task 7: Handle Outliers
[10]:
Xy = dc.handle_outlier(features,Xy)

Outliers

Recommend Algorithm ...

The recommended approach is isolation forest.
Do you want to apply the recommended outlier detection approach? [y/n]y

Visualize Outliers ...

_images/Example_tasks_17_4.png
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 anomaly_score
36 1 0 0 2 1 1 1 1 0 4 11 2 2 3.97174 3.91333 40 0 -0.0883411
34 0 1 2 2 3 1 1 0 1 2 10 0 2 2.5 2.1 40 0 -0.0724028
40 1 0 0 0 1 0 1 0 4.87097 7.44444 11 1 4 3.97174 3.91333 38.0392 0 -0.0318737
18 1 0 0 0 1 0 1 0 4.87097 7.44444 11 1 2 3.97174 3.91333 38 0 -0.0313807
56 0 2 2 0.594595 3 0.545455 0 1.03704 14 7.44444 9 2 6 6 4 35 1 -0.0218354
8 0 1 1.32432 0.594595 2 0 0 1.03704 25 12 11 0 3 7 3.91333 38 1 -0.0199344
17 0.1 1 0 2 1 1 0 1 3 2 9 0 2.1 3.97174 3.91333 40 0 -0.0124747
31 0 1 2 2 3 1 0 0 5 7.44444 10 0 3 2 2.5 40 0 -0.00911494
6 0 0 1 2 3 0.545455 0 2 4.87097 7.44444 12 2 4 5 5 38.0392 1 -0.00885507
37 0.1 1 0 0 1 1 0 2 3 2 9 0 2.8 3.97174 3.91333 38 0 -0.00362979
38 0 1.10811 0 0.594595 3 0.545455 0.285714 2 4.87097 7.44444 10 1 2 2.5 2 37 0 0.00190097
25 0 1 2 0 3 0.545455 0.285714 0 4.87097 7.44444 10 0 2 2 2 40 0 0.00266917
33 0.1 0 0 0 2 1 1 0 3 7.44444 10 0 4 5 3.91333 40 0 0.00353423
41 0 0 2 0 2 0 0 2 4.87097 7.44444 12 2 2 3 3.91333 38 0 0.0133322
35 0 0 2 0 2 1 0 0 4.87097 7.44444 11 1 2 2 3.91333 40 0 0.0134714
19 0.1 1.10811 1.32432 1 2 0.545455 0.285714 1.03704 5 13 15 2 4 5 3.91333 35 1 0.0161718
7 0.1 1.10811 1.32432 0.594595 3 0.545455 0.285714 1.03704 3 7.44444 12 0 6.9 4.8 2.3 40 1 0.0208748
11 0.1 2 1.32432 0.594595 2 0.545455 0.285714 1.03704 4 7.44444 15 0.960784 6.4 6.4 3.91333 38 1 0.0243395
14 0.1 1.10811 1.32432 0 1 1 0.285714 1.03704 10 7.44444 11 2 3 3.97174 3.91333 36 1 0.0247281
9 0.1 2 1.32432 0 1 0.545455 0 2 4 7.44444 11 2 5.7 3.97174 3.91333 40 1 0.0269993
44 0.1 0 0 0 2 0.545455 1 0 3 7.44444 10 0 4 4 3.91333 40 0 0.0270557
27 0 1.10811 2 0 2 0 0.285714 1.03704 4.87097 7.44444 12 2 3 3 3.91333 33 1 0.0310571
13 0 2 2 1 3 0.545455 0.285714 1.03704 4 7.44444 13 2 3.5 4 5.1 37 1 0.0398261
26 0.1 0 1 1 2 0 0 1.03704 4.87097 7.44444 10 0 4.5 4.5 3.91333 38.0392 1 0.0399781
29 0 1.10811 2 0.594595 3 0.545455 0.285714 0 4.87097 7.44444 10 1 2 2.5 3.91333 35 0 0.0402457
53 0.1 2 2 0 3 0.545455 0 2 6 7.44444 11 1 4 3.5 3.91333 40 1 0.0404225
42 0 1.10811 1.32432 2 2 0.545455 0.285714 2 4.87097 7.44444 12 1 2.5 2.5 3.91333 39 0 0.0435786
54 0.1 1.10811 2 0 3 0.545455 0 2 6 10 11 2 5 4.4 3.91333 38 1 0.0459069
1 0.1 2 2 0.594595 2 0 0.285714 1 4.87097 7.44444 11 0 4.5 5.8 3.91333 35 1 0.0467527
24 0.1 1.10811 1.32432 0.594595 1 0.545455 0.285714 1.03704 3 8 9 2 6 3.97174 3.91333 38 1 0.0475632
51 0 1 1.32432 1 3 0 0 2 4.87097 7.44444 11.0943 0.960784 2 3 3.91333 38.0392 1 0.0477415
28 0 2 2 0 2 1 0 1.03704 5 7.44444 11 0 5 4 3.91333 37 1 0.0485515
22 0.1 1.10811 1.32432 1 3 0.545455 0.285714 1.03704 4.87097 7.44444 11.0943 0.960784 3.5 4 4.6 27 1 0.0497246
45 0.1 1 1 0.594595 2 1 1 1.03704 2 7.44444 10 0 4.5 4 3.91333 40 0 0.0502415
3 0 1.10811 1.32432 2 3 0 0.285714 1.03704 4.87097 7.44444 11.0943 0.960784 3.7 4 5 38.0392 1 0.0507025
5 0.1 1.10811 1.32432 0.594595 2 0 0.285714 1.03704 6 7.44444 12 1 2 2.5 3.91333 35 1 0.0529418
12 0.1 1 1 0 2 1 1 1.03704 2 7.44444 10 0 3.5 4 3.91333 40 0 0.0536357
10 0 1.10811 2 0 3 0.545455 0.285714 1.03704 3 7.44444 13 2 3.5 4 4.6 36 1 0.0586344
52 0 1.10811 2 1 3 0.545455 0.285714 1.03704 4.87097 7.44444 13 2 3.5 4 4.5 35 1 0.0599152
49 0 2 2 0 2 0.545455 0 1 4.87097 7.44444 11 1 5.7 4.5 3.91333 40 1 0.0640585
16 0.1 1.10811 1.32432 0.594595 1 0.545455 0.285714 1.03704 2 7.44444 12 0 2.8 3.97174 3.91333 35 1 0.0649904
48 0.1 1.10811 2 0 2 0.545455 0 1.03704 5 14 11 0 5 4.5 3.91333 38 1 0.0669109
2 0 1 1 0.594595 2.16071 0.545455 0 2 5 7.44444 11 2 3.80357 3.97174 3.91333 38 1 0.067712
43 0 1.10811 1.32432 1 2 0.545455 0.285714 0 4.87097 7.44444 11 0 2.5 3 3.91333 40 0 0.0685044
15 0 2 1.32432 0 2 0.545455 0.285714 2 4.87097 7.44444 11 1 4.5 4 3.91333 37 1 0.0695864
50 0.1 2 1.32432 0.594595 2 0.545455 0 1.03704 4.87097 7.44444 11 0.960784 7 5.3 3.91333 38.0392 1 0.0718371
39 0 2 1 0 2 0.545455 0 1.03704 4 7.44444 12 1 4.5 4 3.91333 40 1 0.0732847
55 0 1 1 0.594595 3 0.545455 0.285714 1.03704 4.87097 7.44444 12 1 5 5 5 40 1 0.0735484
21 0.1 1.10811 1.32432 0.594595 2 0.545455 0.285714 0 4.87097 7.44444 11 0 2.5 3 3.91333 40 0 0.0737297
30 0.1 1 1.32432 0 3 1 0.285714 1.03704 4.87097 7.44444 11 1 4.5 4.5 5 40 1 0.0739734
32 0.1 1.10811 1.32432 0.594595 2 0.545455 0.285714 2 4.87097 7.44444 10 1 2.5 2.5 3.91333 38 0 0.0760219
20 0.1 2 2 0.594595 2 0.545455 0.285714 1.03704 4 7.44444 12 2 4.3 4.4 3.91333 38 1 0.0796261
4 0 1 1 0.594595 3 0.545455 0.285714 1.03704 4.87097 7.44444 12 1 4.5 4.5 5 40 1 0.0797176
0 0 1.10811 1.32432 0.594595 1 0.545455 0.285714 1.03704 2 7.44444 11 1 5 3.97174 3.91333 40 1 0.0805313
46 0 2 2 0 2 0.545455 0.285714 1.03704 5 7.44444 11 1 4.5 4 3.91333 40 1 0.0913668
23 0.1 1 2 0.594595 2 0.545455 0.285714 1.03704 4 7.44444 10 2 4.5 4 3.91333 40 1 0.0932795
47 0.1 1 1 1 2 0.545455 0 1.03704 4.87097 7.44444 11.0943 0.960784 4.6 4.6 3.91333 38 1 0.101976
_images/Example_tasks_17_6.png
_images/Example_tasks_17_7.png

Drop Outliers ...

Do you want to drop outliers? [y/n]n
Outliers are kept.
[ ]:

API Reference

API

A data cleaning Python tool.

dataclean.autoclean(Xy, dataset_name, features)

Auto-cleans data.

The following aspects are automatically cleaned: show important features; show statistical information; discover the data type for each feature; identify the duplicated rowsl; unify the inconsistent column names; handle missing values; handle outliers.

Parameters:

Xy : array-like

Complete data.

dataset_name : string

features : list

List of feature names.

Returns:

Xy_cleaned : array-like

Cleaned data.

dataclean.build_forest(X, y)

Build random forest model from the dataset and compute important features

Parameters:

X : array-like, shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples,)

Target values (class labels in classification, real numbers in regression).

Returns:

importances : array, shape = [n_features]

The feature importances (the higher, the more important the feature).

indices : array, shape = [n_features]

Reverse the importances.

dataclean.clean_duplicated_rows(Xy)

Clean duplicated rows.

Parameters:

Xy : array-like

Complete numpy array (target required) of the dataset.

Returns:

Xy : array-like

Original data.

Xy_no_duplicate : array-like

Cleaned data without duplicated rows if user wants to drop the duplicated rows.

dataclean.clean_missing(df, features)

Clean missing values in the dataset.

Parameters:

df : DataFrame

features : List

List of feature names.

Returns:

features_new : List

List of feature names after cleaning.

Xy_filled : array-like

Numpy array where missing values have been cleaned.

dataclean.compute_clustering_metafeatures(X)

Computes clustering meta features.

The following 3 clustering meta features are adopted: Silhouette Coefficient; Calinski_Harabasz Index; Davies_Bouldin Index.

dataclean.compute_imputation_score(Xy)

Computes score of the imputation by applying simple classifiers.

The following simple learners are evaluated: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner.

Parameters:

Xy : array-like

Complete numpy array of the dataset. The training array X has to be imputed already, and the target y is required here and not optional in order to predict the performance of the imputation method.

Returns:

imputation_score : float

Predicted score of the imputation method.

dataclean.compute_metafeatures(X, y)

Computes landmarking meta features.

The following landmarking features are computed: Naive Bayes Learner; Linear Discriminant Learner; One Nearest Neighbor Learner; Decision Node Learner; Randomly Chosen Node Learner.

dataclean.deal_mar(df)

Deal with missing data with missing at random pattern.

dataclean.deal_mcar(df)

Deal with missing data with missing completely at random pattern.

dataclean.deal_mnar(df)

Deal with missing data with missing at random pattern.

dataclean.discover_type_heuristic(data)

Infer data types for each feature using simple logic

Parameters:

data : numpy array or dataframe

Numeric data needs to be 64 bit.

Returns:

result : list

List of data types.

dataclean.discover_types(Xy)

Discover types for numpy array.

Both simple logic rules and Bayesian methods are applied. Bayesian methods can only be applied if Xy are numeric.

Parameters:

Xy : numpy array or DataFrame

Xy can only be numeric in order to run the Bayesian model.

dataclean.drop_duplicated_rows(dataframe)

Drop duplicatd rows.

dataclean.drop_outliers(df, df_outliers)

Drops the detected outliers.

dataclean.handle_missing(features, Xy)

Handle missing values.

Recommend the approprate approach to the user given the missing mechanism of the dataset. The user can choose to adopt the recommended approach or take another available approach.

For MCAR, the following methods are evaluated: ‘list deletion’, ‘mean’, ‘mode’, ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.

For MAR, the following methods are evaluated: ‘k nearest neighbors’, ‘matrix factorization’, ‘multiple imputation’.

For MNAR, ‘multiple imputation’ is adopted.

Parameters:

features : list

List of feature names.

Xy : array-like

Complete numpy array (target required and not optional).

Returns:

features_new : List

List of feature names after cleaning.

Xy_filled : array-like

Numpy array where missing values have been cleaned.

dataclean.handle_outlier(features, Xy)

Cleans the outliers.

Recommends the algorithm to the user to detect the outliers and presents the outliers to the user in effective visualizations. The user can decides whether or not to keep the outliers.

Parameters:

features : list

List of feature names.

Xy : array-like

Numpy array. Both training vectors and target are required.

Returns:

Xy_no_outliers : array-like

Cleaned data where outliers are dropped.

Xy : array-like

Original data where outliers are not found or kept.

dataclean.highlight_outlier(data)

Highlight the maximum in a Series yellow.

dataclean.identify_missing(df=None)

Detect missing values.

Identify the common missing characters such as ‘n/a’, ‘na’, ‘–’ and ‘?’ as missing. User can also customize the characters to be identified as missing.

Parameters:

df : DataFrame

Raw data formatted in DataFrame.

Returns:

flag : bool

Indicates whether missing values are detected. If true, missing values are detected. Otherwise not.

dataclean.identify_missing_mechanism(df=None)

Tries to guess the missing mechanism of the dataset.

Missing mechanism is not really testable. There may be reasons to suspect that the dataset belongs to one missing mechanism based on the missing correlation between features, but the result is not definite. Relevant information are provided to help the user make the decision. Three missng mechanisms to be guessed: MCAR: Missing completely at ramdom MAR: Missing at random MNAR: Missing not at random (not available here, normally involes field expert)

Parameters:

df : DataFrame

Raw data formatted in DataFrame.

dataclean.identify_outliers(df, algorithm=0, detailed=False)

Identifies outliers in multi dimension.

Dataset has to be parsed as numeric beforehand.

dataclean.infer_feature_type(feature)

Infer data types for the given feature using simple logic.

Possible data types to infer: boolean, date, float, integer, string Feature that is not either a boolean, a date, a float or an integer, is classified as a string.

Parameters:

feature : array-like

A feature/attribute vector.

Returns:

data_type : string

The data type of the given feature/attribute.

dataclean.missing_preprocess(features, df=None)

Drops the redundant information.

Redundant information is dropped before imputation. Detects and drops empty rows. Detects features and instances with extreme large proportion of missing data and reports to the user.

Parameters:

features : list

List of feature names.

df : DataFrame

Returns:

df : DataFrame

New DataFrame where redundant information may have been deleted.

features_new: list

List of feature names after preprocessing.

dataclean.plot_feature_importances(dataset_name, features, importances, indices)

Plot the 15 most important features.

dataclean.predict_best_anomaly_algorithm(X, y)

Predicts best anomaly detection algorithm.

Recommends the best anomaly detection algorithm to the user given the characteristics of the dataset. The following algorithms are considered: 0: isolation forest; 1: local outlier factor; 2: one class support vector machine.

dataclean.show_important_features(X, y, data_name, features)

Show the most important features of the given dataset.

Computes the most important features of the given dataset using random forest, and present the 15 most useful features to the user with a bar chart.

Parameters:

X : array-like, shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples,)

Target values (class labels in classification, real numbers in regression).

data_name : string

Dataset name.

features : list

List of feature names.

dataclean.show_statistical_info(Xy)

Show statistical information of the given dataset

Parameters:Xy : array-like
dataclean.train_metalearner()

Train metalearner

dataclean.unify_name_consistency(names)

Unify inconsistent column names.

Parameters:

names : list

List of original column names.

Returns:

names : list

Unified column names.

dataclean.visualize_missing(df=None)

Visualize missing values.

The missingness of the dataset is visualized in bar chart, matrix and heatmap.

dataclean.visualize_outliers_parallel_coordinates(df_scaled, df_pred)

Visualizes high-dimensional outliers with a parallel coordinates plot.

dataclean.visualize_outliers_scatter(df, df_pred)

Visualizes high-dimensional outliers with a scatter plot.

Selects out the two features most likely to have outliers and shows them in a scatter plot.

Indices and tables