BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

date

May 26, 2023

slug

nlp-2

author

status

Public

Title: Performance dominance of deleting noised dataset on NLP

Keyword: Dynamic Data Pruning, Sub-dataset, GraNd, EL2N

Concept

GraNd & EL2N: two metric method on Img classification model

→ This article tried to apply on NLP tasks

identify the important example in a dataset

Different point with Computer Vision

Within delete the noised method, sometimes can get more better performance than doing a fine-tuning

GraNd

:Expected value of the loss gradient norm

EL2N

:Estimate of GraND (p: output prob. vector)

→ Obtained by averaging the model output ‘p’ across the training epochs.

Progress

Setting: pre-trained model(BERT, NLP Domain), compute thre metrics over fine-tuned steps(not at epoch), 5 independent runs

DataSet: MNLI(inference), AG News(topic classification)

Score computation step: For early steps, results are not reliable than a random sampling. Only in AG News, it provides reasonable scores after 500 steps. MNLI can get better score than random one after >1500 steps.

Preserved Fraction: To investigate the size of training sey affect to the performance. In smaller fraction of subset cases, not performed well. MNLI have faster timing that EL2N & GraNd’s performance overtake the randoms.

Noise Case: highest scoring examples can caused the noise on result → so remove the best sub-dataset. Interesting point on the MNLI dataset, removing >3% dataset appears better performance than whole dataset case.

When limited dataset, EL2N sometimes find the important examples by fine-tuning a few seeds.

Result

Author got different result at NLP → not worked (in Score Computation, Preserved Fraction) BUT in removing the noisy case, can achieve better perfromance when using a subset.

→ Future goal; find a pruning mechanism in a single fine-tuning procedure.

Opinion

How can classified the dataset? → Is there any validation tools?

Academical reason on different results between Vision and NLP, also between language inference and topic classification. (lower labels in NLP)

Need to get more experience on NLP tasks, for considering proper pre-trained model on buisness levels. (required structured service)

Advanced

Noisy Data: https://norman3.github.io/papers/docs/noisy_large_scale_dataset.html: answer?

Reason: constant value on no pruning cases.

more types of metric method on NLP (popular one)