- 본 포스팅은 'Google Advanced Data Analytics Professional Certificate' 과정을 수강하며 요약/정리하기 위한 포스팅입니다.
Models
분류 모델:
from xgboost import XGBClassifier
회귀 모델:
from xgboost import XGBRegressor
Evaluation metrics
분류 모델:
from sklearn.metrics import
accuracy_score (y_true, y_pred, *[, ...]) | Accuracy classification score |
average_precision_score(y_true, ...) | Compute average precision (AP) from prediction scores |
confusion_matrix(y_true, y_pred, *) | Compute confusion matrix to evaluate the performance of the training of a model |
f1_score(y_true, y_pred, *[, ...]) | Compute the F1 score, also known as balanced F-score or F-measure |
fbeta_score(y_true, y_pred, *, beta) | Compute the F-beta score |
metrics.log_loss(y_true, y_pred, *[, eps, ...]) | Log loss, aka logistic loss or cross-entropy loss |
multilabel_confusion_matrix(y_true, ...) | Compute a confusion matrix for each class or sample |
precision_recall_curve(y_true, ...) | Compute precision-recall pairs for different probability thresholds |
precision_score(y_true, y_pred, *[, ...]) | Compute the precision |
recall_score(y_true, y_pred, *[, ...]) | Compute the recall |
roc_auc_score(y_true, y_score, *[, ...]) | Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores |
회귀 모델:
from sklearn.metrics import
mean_absolute_error(y_true, y_pred, *) | Mean absolute error regression loss |
mean_squared_error(y_true, y_pred, *) | Mean squared error regression loss |
mean_squared_log_error(y_true, y_pred, *) | Mean squared logarithmic error regression loss |
median_absolute_error(y_true, y_pred, *) | Median absolute error regression loss |
mean_absolute_percentage_error(...) | Mean absolute percentage error (MAPE) regression loss |
r2_score(y_true, y_pred, *[, ...]) | R2 (coefficient of determination) regression score function |
Hyperparameters
다음은 XGBoost 라이브러리로 구축된 그레디언트 부스팅 분류 모델에 가장 중요한 하이퍼파라미터이다.
아래 하이퍼파라미터들은 데이터 전문가들이 가장 직관적이기 때문에 가장 먼저 조절할 것을 고려한다.
n_estimators
n_estimators | Specifies the number of boosting rounds (i.e., 모델이 앙상블에서 만들 트리 수) | int | 100 |
Considerations:
A typical range is 50–500. Consider how much data you have, how deep the trees are allowed to grow, and how many samples are bootstrapped from the overall data to grow each tree (you generally need more trees if they’re shallow, and more trees if your bootstrap sample size represents just a small fraction of your overall data). For an extreme but illustrative example, if you have a dataset of 10,000, and each tree only bootstraps 20 samples, you'll need more trees than if you gave each tree 5,000 samples. Also keep in mind that, unlike random forest, which can grow base learners in parallel, gradient boosting grows base learners successively, so training can take longer for more trees.
max_depth
max_depth | Specifies how many levels your base learner trees can have. If None, trees grow until leaves are pure or until all leaves have less than min_child_weight. |
int | 3 |
Considerations: Controls complexity of the model. Gradient boosting typically uses weak learners, or “decision stumps” (i.e., shallow trees). Restricting tree depth can reduce training times and serving latency as well as prevent overfitting. Consider values 2–6.
min_child_weight
min_child_weight | Controls threshold below which a node becomes a leaf, based on the combined weight of the samples it contains. For regression models, this value is functionally equivalent to a number of samples. For the binary classification objective, the weight of a sample in a node is dependent on its probability of response as calculated by that tree. The weight of the sample decreases the more certain the model is (i.e., the closer the probability of response is to 0 or 1). |
int or float | 1 |
Considerations: Higher values will stop trees splitting further, and lower values will allow trees to continue to split further.
모형이 과소적합(underfitting)인 경우, 더 복잡해질 수 있도록 값을 낮추는 것이 좋다.
반대로 트리가 너무 미세하게 분할되는 것을 방지하려면 값을 높이면 된다.
learning_rate
learning_rate | Controls how much importance is given to each consecutive base learner in the ensemble’s final prediction. Also known as eta or shrinkage. | float | 0.1 |
Considerations: Values can range from (0–1]. Typical values range from 0.01 to 0.3.
Lower values mean less weight is given to each consecutive base learner.
Consider how many trees are in your ensemble. Lower values typically benefit from more trees.
colsample_bytree*
colsample_bytree* | Specifies the percentage (0–1.0] of features that each tree randomly selects during training | float | 1.0 |
Considerations: Adds randomness to the model to make it robust to noise.
Consider how many features the dataset has and how many trees will be grown.
Fewer features sampled means more base learners might be needed.
Small colsample_bytree values on datasets with many features mean more unpredictive trees in the ensemble.
subsample*
subsample* | Specifies the percentage (0–1.0] of observations sampled from the dataset to train each base model. | float | 1.0 |
Considerations: 모델에 랜덤성을 추가하여 노이즈에 강건하게 한다.
Consider the size of your dataset.
When working with large datasets, it can be beneficial to limit the number of samples in each tree, because doing so can greatly reduce training time and yet still result in a robust model.
For example, 20% of 1 billion might be enough to capture patterns in the data, but if you only have 1,000 samples in your dataset then you’ll probably need to use them all.
Remember that using fractions of the data to train each base learner can possibly improve model predictions and certainly speed up training times.
추가 정보
More detailed information about XGBoost can be found here:
- scikit-learn model metrics: documentation for evaluation metrics in scikit-learn
- XGBoost classifier: XGBoost documentation for classification tasks using the scikit-learn API
- XGBoost Regressor: XGBoost documentation for regression tasks using the scikit-learn API
- Notes on parameter tuning from XGBoost
- XGBoost parameters: XGBoost parameters guide. NOTE: The information in this link is not specific to the scikit-learn API. The default values listed in this resource are not always the same as the ones in the scikit-learn API.
'데이터사이언스 > Google Advanced Data Analytics' 카테고리의 다른 글
직무 면접 종류와 면접 준비하는 데에 꼭 알아야 할 팁 |Elevator pitch, Behavioral interview, Technical interview (0) | 2023.06.20 |
---|---|
프로젝트를 포트폴리오화 시키기 위해서 | Github README 파일 작성 팁, 포트폴리오 예제 (0) | 2023.06.19 |
그래디언트 부트팅(Gradient Boosting = GBMs) 모델 원리 설명 (0) | 2023.06.03 |
랜덤 포레스트 모델 (Random forest) 하이퍼파라미터 튜닝 팁 + Cheat Sheet (0) | 2023.06.03 |
Random Forests란 (1) | 2023.06.03 |