best model 지표 기준

모델을 학습시키고 저장 시에 고민이 생겼다. 어떤 것을 기준으로 좋은 모델이라고 말할 수 있을까? 기존엔 loss를 중심으로 생각했지만 내가 보고 있는 데이터는 불균형이 심하기 때문에 단순히 loss가 기준이 되는 게 아닐거라는 생각을 했다. 그래서 다른 사람들은 어떻게 사용하고 있는지 찾아봤다. 제일 좋은 레퍼런스는 캐글이다. 코드도 많고 성능 중심이기 때문이다. 그 중에 봤던 코드는 Pytorch multi labels by cumulated level 🔗 였다. 여기서 best model을 저장하는 코드만 가져왔다.


if auroc > best_metric:
    best_metric = auroc
    torch.save(model.state_dict(), f'dict_model_{j}_fold_{fold}_ckpt_pytorch')
else :
    early_stoping += 1
    if early_stoping > EARLY_STOPPING :
        print(f'{Fore.RED}{Style.BRIGHT}====> early stopping{Style.RESET_ALL}\\n')
        break
if epoch+1 < 10 :
    a =' '
else :
    a =''
print(f'Epoch: {epoch+1}{a}/{EPOCHS} | Train Loss: {train_loss:.6f} | Val loss: {val_loss:.6f} | Val auc {auroc:.6f} | Best auc {best_metric:.6f}  | lr: {lr} ')
auroc = metric.reset()

여기선 auprc를 기준으로 best model을 저장하고 있었고 early stopping이 지정한 값보다 크다면 학습을 멈추는 로직이다. 데이터의 특성에 맞춰서 best model metric 기준을 잡아야 되나보다. 이대로는 찝찝하니 좀 더 검증된 코드를 찾아봤다. 잘 구현되어있는 허깅페이스의 transformer 코드를 뜯어봤다. 허깅페이스에는 trainer에서 early stopping 방식으로 학습시킬 수 있다. 먼저 trainer callbacks에서 EarlyStoppingCallback 파라미터를 사용해야한다.


trainer = Trainer(
            model,
            args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
        )

metric_for_best_model에서 원하는 metric을 설정해주면 되고 기본은 loss로 되어있다. 이 파라미터를 수정하는 경우엔 greater_is_better도 함께 지정해야한다. auc처럼 높은 값이 좋다면 True, loss처럼 낮은 값이 좋다면 False를 지정해야한다.

metric_for_best_model (str, optional) — Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_". Will default to "loss" if unspecified and load_best_model_at_end=True (to use the evaluation loss).
If you set this value, greater_is_better will default to True. Don’t forget to set it to False if your metric is better when lower.

이외에도 지정할 파라미터들은 아래와 같다.

load_best_model_at_end = True (EarlyStoppingCallback() requires this to be True).

evaluation_strategy=’steps’ # or epoch

eval_steps = 50 (evaluate the metrics after N steps).

결국, loss 외에도 f1이나 auc 등 데이터 특성에 맞춰 early stopping을 하는 것이 맞다. 단, early stopping의 훈련 단위가 step일 경우 test 성능이 더 안 좋을 수 있다는 것을 주의하자.

참고한 자료

Deciding on a metric to save the checkpoint for best val 🔗

허깅페이스(Huggingface) transformers로 early stopping 사용하기 🔗