machine learning - Is it a good practice to reduce a dataset to have a better PCA decomposition -


while trying work on credit card fraud dataset on kaggle (link), found out can have better model if reduce size of dataset training. explain dataset composed of 284807 records of 31 features. in dataset there 492 frauds (so 0.17%).

i've tried pca on full dataset keep 3 important dimensions able display it. result following 1 :

pca full dataset

in one, it's impossible find pattern determine either it's fraud or not.

if reduce dataset of non fraud increase ratio (fraud/non_fraud), have same plot

pca_100000

pca_10000

pca_1000

now, don't know if makes sense fit pca on reduced dataset in order have better decomposition. example, if use pca 100000 points, can entries pca1 > 5 fraud.

this code if want try :

dataset = pd.read_csv("creditcard.csv") sample_size = 284807-492  # between 1 , 284807-492 = dataset[dataset["class"] == 1]  # keep frauds b = dataset[dataset["class"] == 0].sample(sample_size) # reduce non fraud qty  dataset = pd.concat([a, b]).sample(frac=1)  # concat shuffle  # scaling of features pca y = dataset["class"] x = dataset.drop("class", axis=1) x_scale = standardscaler().fit_transform(x)  # doing pca on dataset pca = pca(n_components=3) x_pca = pca.fit_transform(x_scale)  pca1, pca2, pca3, c = x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], y plt.scatter(pca1, pca2, s=pca3, c=y) plt.xlabel("pca1") plt.ylabel("pca2") plt.title("{}-points".format(sample_size)) # plt.savefig("{}-points".format(sample_size), dpi=600) 

thanks help,

it makes sense, definitely.

the technique using commonly known random undersampling, , in ml useful in general when dealing imbalanced data problems (such 1 describing). can see more wikipedia page.

there are, of course, many other methods dealt class imbalance, beauty of 1 is quite simple and, sometimes, effective.


Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -