Random Forests란
Bagging + random feature sampling = random forest
If you build a bagging ensemble of decision trees but take it one step further by randomizing the features used to train each base learner, the result is called a random forest. In this reading, you’ll learn how random forests use this additional randomness to make better predictions, making them a powerful tool for the data professional.
배깅(bagging = bootstrap aggregating)은 많은 기본 학습자를 이용하여 효과적인 예측을 하는 방법이다.
보다 자세한 설명은 아래 링크 참고
https://kosonkh7.tistory.com/142
[머신러닝] Bagging(배깅)이 모델 학습에 자주 쓰이는 이유와 원리 설명 | Bootstrapping + Aggregating
- 본 포스팅은 'Google Advanced Data Analytics Professional Certificate' 과정을 수강하며 요약/정리하기 위한 포스팅입니다. Bootstrapping 과 Aggregating 을 통합하여 배깅이라고 한다. 기초 학습자(base learners)들의
kosonkh7.tistory.com
여기에 무작위화(randomizing)를 더하면 랜덤포레스트 모델이고, 더욱 효과적인 모델이다.
Why randomize?
랜덤 포레스트 모델은 무작위성을 활용하여 기본 학습자가 다른 기본 학습자와 동일한 실수(잘못된 예측)를 할 가능성을 줄인다.
학습자들 사이의 실수가 상관관계가 없을 때, 무작위화는 편향과 분산을 모두 줄여준다.
In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.
관측치가 5개(1, 2, 3, 4, 5)인 데이터 집합을 가정한다.
원본 데이터에서 5개의 관측치로 구성된 부트스트랩을 생성할 경우, 1, 1, 3, 5, 5처럼 보일 수 있다.
5개의 관측치 중 일부 관측치가 누락되고 일부 관측치가 두 번 포함된다.
결과적으로 기본 학습자는 관찰에 의해 무작위화된 데이터에 대해 학습한다.
Random forest goes further. It randomizes the data by features too.
This means that if there are five available features: A, B, C, D, and E, you can set the model to only sample from a subset of them.
In other words, each base learner will only have a limited number of features available to it, but what those features are will vary between learners.
다음은 부트스트래핑과 기능 샘플링을 결합할 때 어떻게 작동하는지 보여주는 예이다.
아래 샘플은 자동차와 관련된 대규모 데이터 세트의 5가지 관측치와 4가지 피처가 포함되어 있다.
Honda Civic | 2007 | 54,000 | $2,739 |
Toyota Corolla | 2018 | 25,000 | $22,602 |
Ford Fiesta | 2012 | 90,165 | $6,164 |
Audi A4 | 2013 | 86,000 | $21,643 |
BMW X5 | 2019 | 30,000 | $67,808 |
각각 3개의 관측치와 2개의 피처의 부트스트랩 샘플에 대해 훈련된 3개의 기본 학습자로 구성된 랜덤 포레스트 모델을 구축하면 다음 3개의 샘플이 생성될 수 있습니다:
Kilometers | Price | Year | Kilometers | Model | Price |
54,000 | $2,739 | 2012 | 90,165 | Honda Civic | $2,739 |
54,000 | $2,739 | 2013 | 86,000 | Ford Fiesta | $6,164 |
90,165 | $6,164 | 2019 | 30,000 | Ford Fiesta | $6,164 |
Notice what happened. Each sample contains three observations of just two features, and it’s possible that some of the observations may be repeated (because they’re sampled with replacement). A unique base learner would then be trained on each sample.
각 표본에는 두 개의 피처에 대한 세 개의 관측치가 포함되어 있으며,
일부 관측치는 대체 표본으로 추출되었기 때문에 반복될 수 있다.
그런 다음 각 샘플에 대해 고유한 기본 학습자를 교육합니다.
These are just toy datasets. In practice, you’ll have much more data, so there will be a lot more available to grow each base learner. But as you can imagine, randomizing the samples of both the observations and the features of a very large dataset allows for a near-infinite number of combinations, thus ensuring that no two training samples are identical.
실제로는 데이터가 훨씬 많아지므로 각 기본 학습자를 성장시킬 수 있는 데이터가 훨씬 더 많아집니다. 그러나 여러분이 상상할 수 있듯이, 관측치와 매우 큰 데이터 세트의 특징 모두의 샘플을 랜덤화하면 거의 무한한 수의 조합이 가능하므로 두 개의 훈련 샘플이 동일하지 않도록 보장할 수 있습니다.
How does all this sampling affect predictions?
The effect of all this sampling is that the base learners each see only a fraction of the possible data that’s available to them. Surely this would result in a model that’s not as good as one that was trained on the full dataset, right?
No! In fact, not only is it possible for model scores to improve with sampling, but they also require significantly less time to run, since each tree is built from less data.
Here is a comparison of five different models, each trained and 5-fold cross-validated on the bank churn dataset from earlier in this course. The full training data had 7,500 observations and 10 features. Aside from the bootstrap sample size and number of features sampled, all other hyperparameters remained the same. The accuracy score is from each model’s performance on the test data.
Bootstrap sample sizeFeatures sampledAccuracyscoreRuntime
Bagging: | 100% | 10 | 0.8596 | 15m 49s |
Bagging: | 30% | 10 | 0.8692 | 7m 41s |
Random forest: | 100% | 4 | 0.8704 | 8m 19s |
Random forest: | 30% | 4 | 0.8736 | 4m 53s |
Random forest: | 5% | 4 | 0.8652 | 3m 41s |
The bagging model with only 30% bootstrapped samples performed better than the one that used 100% samples, and the random forest model that used 30% bootstrapped samples and just 4 features performed better than all the others. Not only that, but runtime was cut by nearly 70% using the random forest model with 30% bootstrap samples.
It may seem counterintuitive, but you can often build a well-performing model with even lower bootstrapping samples. Take for example the above random forest model whose base learners were each built from just 5% samples of the training data. It still was able to achieve a 0.8652 accuracy score—not much worse than the champion model!
Key takeaways
Random forest builds on bagging, taking randomization even further by using only a fraction of the available features to train its base learners. This randomization from sampling often leads to both better performance scores and faster execution times, making random forest a powerful and relatively simple tool in the hands of any data professional.
Resources for more information
More detailed information about random forests can be found here.
- scikit-learn documentation:
- Random forest classifier: Documentation for model used for classification tasks
- Random forest regressor: Documentation for model used for regression tasks