๐ŸŽข ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„์„(ROSE package, ovunsample() / rose()

2 minute read

๐Ÿ”ต ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„์„ (๋ถ„๋ฅ˜๋ชจํ˜•)

์‹ค๋ฌด์—์„œ ์ผํ•˜๋‹ค๋ณด๋ฉด, ์˜ˆ์ธก๋ณ€์ˆ˜์˜ ํด๋ž˜์Šค๊ฐ€ ์ ์€ ๊ฒฝ์šฐ๋‹ค ํ”ํ•˜๋‹ค๊ณ  ํ•œ๋‹ค. ๋ฌผ๋ก  ๋‚˜์˜ ๊ฒฝ์šฐ ๋ถ„๋ฅ˜ ๋ถ„์„๋ณด๋‹ค๋Š” ์‹œ๊ณ„์—ด ๋ถ„์„์„ ๋งŽ์ด ํ•˜๋‹ค๋ณด๋‹ˆ ์กฐ๊ธˆ ๋‹ค๋ฅธ ์—๋กœ์‚ฌํ•ญ์ด ์žˆ๊ธด ํ–ˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๊ฐ€ ํ”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„์„์— ํฐ ์ฐจ์งˆ์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด, ํ•œ์ชฝ์œผ๋กœ ์น˜์šฐ์นœ unbalancedํ•œ ๋ฐ์ดํ„ฐ๋Š” ๋ชจํ˜•์„ ๊ตฌ์ถ•ํ–ˆ์„ ๋•Œ, ํŽธํ–ฅ๋œ ์„ฑํ–ฅ์„ ๊ฐ–๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰ ๊ฐ€์ค‘์น˜์˜ ๋ฌธ์ œ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์ ์€ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ ํฐ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ํ•ด๋‹น ๋ถ„์•ผ๊ฐ€ ๋ฐ”๋กœ ์‹ ์šฉ์นด๋“œ์™€ ์˜๋ฃŒ์™€ ๊ฐ™์€ ๋ถ„์•ผ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์‹ ์šฉ์นด๋“œ์˜ ๊ฒฝ์šฐ ๋งค๋…„ ์•ฝ 2%๊ฐ€ ๋„์šฉ๋œ๋‹ค๊ณ  ํ•˜๋ฉฐ, ์งˆ๋ณ‘๊ฒ€์‚ฌ์— ์žˆ์–ด์„œ ํฌ๊ท€๋ณ‘ ๋ฐœ๋ณ‘๋ฅ ์ด 0.4%์— ๋ถˆ๊ณผํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ๋กœ ๋ชจํ˜•์„ ๊ตฌ์ถ•ํ–ˆ์„ ๊ฒฝ์šฐ, ์‹ ์šฉ์นด๋“œ๋ฅผ ๋„์šฉํ–ˆ์ง€๋งŒ ๋˜๋Š” ํฌ๊ท€๋ณ‘์— ๊ฑธ๋ ธ์ง€๋งŒ, ๊ตฌ์ถ•ํ•œ ๋ชจํ˜•์ด ํŒ๋‹จํ–ˆ์„ ๋•Œ, ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋Š” ๊ฒฐ๋ก ์„ ์ด๋Œ์–ด ๋‚ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰ ์ด ๋ถ€๋ถ„์€ ํ™•์‹คํžˆ ์˜ˆ์ธก๋ชจํ˜•๊ณผ๋Š” ๋‹ค๋ฅธ ์—๋กœ์‚ฌํ•ญ์ธ ๊ฒƒ ๊ฐ™๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ช‡ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋ก ์ด ์กด์žฌํ•˜๋Š”๋ฐ, ์˜ค๋Š˜์€ ์ด๋ฅผ ์ด์•ผ๊ธฐํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

  1. ์•„๋ฌด๊ฒƒ๋„ ๋Œ€์‘ํ•˜์ง€ ์•Š๋Š”๋‹ค.
  2. Oversampling
  3. Undersampling
  4. ์†Œ์ˆ˜ ํ‘œ๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐํ•ฉํ•ด์„œ ์ƒ์„ฑ
  5. ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•œ๋‹ค.
    • ํด๋ž˜์Šค ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜๋˜๊ฐ€
    • ์ปท์˜ค๋ธŒ ๊ธฐ์ค€์„ ์กฐ์ •(ํด๋ž˜์Šค ๊ตฌ๋ถ„ ๊ธฐ์ค€์„ ์กฐ์ •)
    • ์†Œ์ˆ˜ ํ‘œ๋ณธ ๋ฐ์ดํ„ฐ์— ์ข€๋” ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์กฐ์ •ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ์กด์žฌํ•œ๋‹ค.

์˜ค๋Š˜์€ ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„์„ ์˜ˆ์‹œ๋กœ ๋งŽ์ด ํ™œ์šฉ๋˜๋Š” ROSE ํŒจํ‚ค์ง€์˜ hacide ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด๋ณด์ž

install.packages("ROSE")
library(ROSE) 
data(hacide)

ํ•ด๋‹น ํŒจํ‚ค์ง€ ์•ˆ์— ovun.sample์ด๋ž€ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ sampling์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ณต์‹๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด ovun.sample ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

name.function method
Ovun.sample Over-sampling, under-sampling, combination of over- and undersampling.

Usage

ovun.sample(formula, data, method="both", N,p=0.5,
            subset=options("subset")$subset,
            na.action=options("na.action")$na.action, seed)

formula: ์˜ˆ์ธกํ•  ๋ณ€์ˆ˜ R์—์„œ ๊ถŒ์žฅํ•˜๋Š” ์‹์˜ ํ˜•ํƒœ๋กœ ๋„ฃ์–ด์ค€๋‹ค

data: ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์ค€๋‹ค

method: c("over", "under", "both") ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•ด์„œ ๋„ฃ์–ด์ค€๋‹ค. ์ด๋•Œ over๋Š” oversample, under๋Š” undersample, both๋Š” ์–‘์ชฝ(ํด๋ž˜์Šค ๋‘˜ ๋‹ค ๋žจ๋ค์œผ๋กœ)์„ ๋ฝ‘๋Š”๋‹ค๋กœ ์ƒ๊ฐํ•˜๋ฉด ๋˜๊ฒ ๋‹ค.

N: ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ์ด๋‹ค. ์ด๋•Œ ์ฃผ์˜์ ์ด ์žˆ๋Š”๋ฐ, ํ•ด๋‹น ๋ถ€๋ถ„์€ ๊ณต์‹๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.

๋‚˜๋จธ์ง€ ์ธ์ž ๋˜๋Š” ์˜ต์…˜๋„ ์‚ฌ์šฉ๋ฐฉ๋ฒ•์ด ์–ด๋ ต์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๊ณต์‹๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜๊ธฐ ๋ฐ”๋ž€๋‹ค.

https://cran.r-project.org/web/packages/ROSE/ROSE.pdf

์ค‘์š”ํ•œ๊ฑด ovun.sampleํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฒฐ๊ณผ๊ฐ’์ด ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ์ด๋‹ค.

oversampling  <- ovun.sample(cls ~ ., data = hacide.train, method = "over", N = 1960)

class(oversampling)
[1] "ovun.sample"

class(oversampling[1])
[1] "list"

๋”ฐ๋ผ์„œ $๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋งŒ ๋ฝ‘์•„์ฃผ์ž. ์–ด์ฐจํ”ผ ์ด๊ฑด๋งŒ ์‚ฌ์šฉํ•  ๊ฑฐ๋‹ˆ๊นŒ.

oversampling  <- ovun.sample(cls ~ ., data = hacide.train, method = "over", N = 1960)$data
class(oversampling)
[1] "data.frame"

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— $data๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด data.frame ํ˜•ํƒœ๋กœ class๊ฐ€ ์ถœ๋ ฅ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

# OverSampling
oversampling  <- ovun.sample(cls ~ ., data = hacide.train, method = "over", N = 1960)$data
# UnderSampling
undersampling <- ovun.sample(cls ~ ., data = hacide.train, method = "under", N = 40, seed = 1)$data
# BothSampling
bothsampling <- ovun.sample(cls ~ ., data = hacide.train, method = "both", p = 0.5, N = 1000, seed = 1)$data
# ROSESampling
rose <- ROSE(cls ~ ., data = hacide.train)$data

# ๋‚˜๋ฌด ๋ชจํ˜•
raw   <- rpart(cls ~ ., data = hacide.train) # ์•„๋ฌด๊ฒƒ๋„ ์•ˆํ•จ
over  <- rpart(cls ~ ., data = oversampling) # oversample 
under <- rpart(cls ~ ., data = undersampling) # undersample
both  <- rpart(cls ~ ., data = bothsampling) # both
rose  <- rpart(cls ~ ., data = rose) # rosesample

# ์˜ˆ์ธก
pred_raw    <- predict(raw  , newdata = hacide.test)
pred_over   <- predict(over , newdata = hacide.test)
pred_under  <- predict(under, newdata = hacide.test)
pred_both   <- predict(both , newdata = hacide.test)
pred_rose   <- predict(rose , newdata = hacide.test)


# AUC ๋„“์ด (1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ข‹์€ ๋ชจํ˜•)
roc.curve(hacide.test$cls, pred_raw[,2], plot=FALSE) # 0.600
roc.curve(hacide.test$cls, pred_over[,2], plot=FALSE) # 0.798
roc.curve(hacide.test$cls, pred_under[,2], plot=FALSE) # 0.924
roc.curve(hacide.test$cls, pred_both[,2], plot=FALSE) # 0.798
roc.curve(hacide.test$cls, pred_rose[,2], plot=FALSE) # 0.985
# 
# 
# 

์ฐธ๊ณ ๋กœ roc.curve๋Š” ROSE ํŒจํ‚ค์ง€ ๋‚ด์— ์กด์žฌํ•œ๋‹ค. ์‹ค์ œ test ๋ฐ์ดํ„ฐ์™€ ์˜ˆ์ธก ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. plot์„ TRUE๋กœ ๋ฐ”๊พธ๋ฉด plot์„ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.


ROSE ํŒจํ‚ค์ง€์˜ ovun.sample() ์žŠ์ง€๋ง์ž. under, over, both๋ฅผ ์„ ํƒํ•˜๊ฑฐ๋‚˜, rose()ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.