데이터사이언스/Google Advanced Data Analytics

랜덤 포레스트 모델 (Random forest) 하이퍼파라미터 튜닝 팁 + Cheat Sheet

누군가의 이야기 2023. 6. 3. 16:02
728x90

 

- 본 포스팅은 'Google Advanced Data Analytics Professional Certificate' 과정을 수강하며 요약/정리하기 위한 포스팅입니다.

 

랜덤 포레스트를 활용할 때 불러오면 좋을 필수 라이브러리 모음 (사이킷런)

 

Models

분류 모델:

from sklearn.ensemble import RandomForestClassifier

 

회귀모델: 

from sklearn.ensemble import RandomForestRegressor

 

Evaluation metrics 

분류 모델: 

from sklearn.metrics import

accuracy_score(y_true, y_pred, *[, ...]) Accuracy classification score.
average_precision_score(y_true, ...) Compute average precision (AP) from prediction scores.
confusion_matrix(y_true, y_pred, *) Compute confusion matrix to evaluate the performance of the training of a model .
f1_score(y_true, y_pred, *[, ...]) Compute the F1 score, also known as balanced F-score or F-measure.
fbeta_score(y_true, y_pred, *, beta) Compute the F-beta score.
metrics.log_loss(y_true, y_pred, *[, eps, ...]) Log loss, aka logistic loss or cross-entropy loss.
multilabel_confusion_matrix(y_true, ...) Compute a confusion matrix for each class or sample.
precision_recall_curve(y_true, ...) Compute precision-recall pairs for different probability thresholds.
precision_score(y_true, y_pred, *[, ...]) Compute the precision.
recall_score(y_true, y_pred, *[, ...]) Compute the recall.
roc_auc_score(y_true, y_score, *[, ...]) Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

 

 

회귀모델:

from sklearn.metrics import

mean_absolute_error(y_true, y_pred, *) Mean absolute error regression loss.
mean_squared_error(y_true, y_pred, *) Mean squared error regression loss.
mean_squared_log_error(y_true, y_pred, *) Mean squared logarithmic error regression loss.
median_absolute_error(y_true, y_pred, *) Median absolute error regression loss.
mean_absolute_percentage_error(...) Mean absolute percentage error (MAPE) regression loss.
r2_score(y_true, y_pred, *[, ...]) R2 (coefficient of determination) regression score function.

 

Hyperparameters

다음은 scikit-learn에서 랜덤 포레스트 분류 모델의 핵심 하이퍼파라미터.

 

 

Hyperparameter What it does Input type Default Value Considerations
n_estimators Specifies the number of trees your model will build in its ensemble int 100 A typical range is 50–500. Consider how much data you have, how deep the trees are allowed to grow and how many samples are bootstrapped to grow each tree (you generally need more trees if they’re shallow, and more trees if your bootstrap sample size is smaller). Also consider if your use case has latency requirements.
max_depth Specifies how many levels your tree can have.

If None,

trees grow until leaves are pure or until all leaves contain less than min_samples_split samples.
int None

Random forest models often use base learners that are fully grown, but restricting tree depth can reduce train/latency times and prevent overfitting. If not None, consider values 3–8.
min_samples_split Controls threshold below which nodes become leaves

If float,

then it represents a percentage (0–1] of max_samples.
int or float 2 Consider (a) how many samples are in your dataset, and (b) how much of that data you're allowing each base learner to use (i.e., the value of the max_samples hyperparameter). The fewer samples available, the lesser the number of samples may need to be allowed  in each leaf node (otherwise the tree would be very shallow).
min_samples_leaf A split can only occur if it guarantees a minimum of this number of observations in each resulting node.
 
If float,
then it represents a percentage (0–1] of max_samples.
int or float 1 Consider (a) how many samples are in your dataset, and (b) how much of that data you're allowing each base learner to use (i.e., the value of the max_samples hyperparameter). The fewer samples available, the lesser the number of samples may need to be allowed  in each leaf node (otherwise the tree would be very shallow).
max_features Specifies the number of features that each tree randomly selects during training
 
If int,
then consider max_features features at each split.
 
If float,
then max_features is a fraction and round(max_features * n_features) features are considered at each split.
 
If “sqrt”, then max_features=sqrt(n_features).
 
If “log2”, then max_features=log2(n_features).
 
If None, then max_features=n_features.
 
{“sqrt”, “log2”, None},
 int or float,
“sqrt” Consider how many features the dataset has and how many trees will be grown. Fewer features sampled during each bootstrap means more base learners would be needed. Small max_features values on datasets with many features mean more unpredictive trees in the ensemble.
max_samples* Specifies the number of samples bootstrapped from the dataset to train each base model
 
If float,
then it represents a percentage (0–1] of the dataset.
 
If None,
then draw X.shape[0] samples.
int or float None * Consider the size of your dataset. When working with large datasets, it can be beneficial to limit the number of samples in each tree, because doing so can greatly reduce training time and yet still result in a robust model. For example, 20% of 1 billion may be enough to capture patterns in the data, but if you have 1,000 samples then you’ll probably need to use them all.

 

기계 학습 모델을 구축할 때는 올바른 도구를 갖추고 사용 방법을 이해하는 것이 필수적이다.

 

다른 하이퍼파라미터도 많지만, 위에 포함된 하이퍼파라미터가 핵심.

 

하이퍼파리미터를 조작하며 모델을 개선하는 건 은근 재밌다.

 

추가정보 

728x90