machine learning - Is it a good practice to reduce a dataset to have a better PCA decomposition -
while trying work on credit card fraud dataset on kaggle (link), found out can have better model if reduce size of dataset training. explain dataset composed of 284807 records of 31 features. in dataset there 492 frauds (so 0.17%).
i've tried pca on full dataset keep 3 important dimensions able display it. result following 1 :
in one, it's impossible find pattern determine either it's fraud or not.
if reduce dataset of non fraud increase ratio (fraud/non_fraud), have same plot
now, don't know if makes sense fit pca on reduced dataset in order have better decomposition. example, if use pca 100000 points, can entries pca1 > 5 fraud.
this code if want try :
dataset = pd.read_csv("creditcard.csv") sample_size = 284807-492 # between 1 , 284807-492 = dataset[dataset["class"] == 1] # keep frauds b = dataset[dataset["class"] == 0].sample(sample_size) # reduce non fraud qty dataset = pd.concat([a, b]).sample(frac=1) # concat shuffle # scaling of features pca y = dataset["class"] x = dataset.drop("class", axis=1) x_scale = standardscaler().fit_transform(x) # doing pca on dataset pca = pca(n_components=3) x_pca = pca.fit_transform(x_scale) pca1, pca2, pca3, c = x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], y plt.scatter(pca1, pca2, s=pca3, c=y) plt.xlabel("pca1") plt.ylabel("pca2") plt.title("{}-points".format(sample_size)) # plt.savefig("{}-points".format(sample_size), dpi=600) thanks help,
it makes sense, definitely.
the technique using commonly known random undersampling, , in ml useful in general when dealing imbalanced data problems (such 1 describing). can see more wikipedia page.
there are, of course, many other methods dealt class imbalance, beauty of 1 is quite simple and, sometimes, effective.




Comments
Post a Comment