machine learning - Is it a good practice to reduce a dataset to have a better PCA decomposition -

March 15, 2010

while trying work on credit card fraud dataset on kaggle (link), found out can have better model if reduce size of dataset training. explain dataset composed of 284807 records of 31 features. in dataset there 492 frauds (so 0.17%).

i've tried pca on full dataset keep 3 important dimensions able display it. result following 1 :

in one, it's impossible find pattern determine either it's fraud or not.

if reduce dataset of non fraud increase ratio (fraud/non_fraud), have same plot

now, don't know if makes sense fit pca on reduced dataset in order have better decomposition. example, if use pca 100000 points, can entries pca1 > 5 fraud.

this code if want try :

dataset = pd.read_csv("creditcard.csv") sample_size = 284807-492  # between 1 , 284807-492 = dataset[dataset["class"] == 1]  # keep frauds b = dataset[dataset["class"] == 0].sample(sample_size) # reduce non fraud qty  dataset = pd.concat([a, b]).sample(frac=1)  # concat shuffle  # scaling of features pca y = dataset["class"] x = dataset.drop("class", axis=1) x_scale = standardscaler().fit_transform(x)  # doing pca on dataset pca = pca(n_components=3) x_pca = pca.fit_transform(x_scale)  pca1, pca2, pca3, c = x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], y plt.scatter(pca1, pca2, s=pca3, c=y) plt.xlabel("pca1") plt.ylabel("pca2") plt.title("{}-points".format(sample_size)) # plt.savefig("{}-points".format(sample_size), dpi=600)

thanks help,

it makes sense, definitely.

the technique using commonly known random undersampling, , in ml useful in general when dealing imbalanced data problems (such 1 describing). can see more wikipedia page.

there are, of course, many other methods dealt class imbalance, beauty of 1 is quite simple and, sometimes, effective.

Search This Blog

Single

machine learning - Is it a good practice to reduce a dataset to have a better PCA decomposition -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -