π¦ DL
BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning
Title: Performance dominance of deleting noised dataset on NLP
Keyword: Dynamic Data Pruning, Sub-dataset, GraNd, EL2N
Concept
GraNd & EL2N: two metric method on Img classification model
β This article tried to apply on NLP tasks
- identify the important example in a dataset
- Different point with Computer Vision
- Within delete the noised method, sometimes can get more better performance than doing a fine-tuning
GraNd
:Expected value of the loss gradient norm
EL2N
:Estimate of GraND (p: output prob. vector)
β Obtained by averaging the model output βpβ across the training epochs.
Progress
Setting: pre-trained model(BERT, NLP Domain), compute thre metrics over fine-tuned steps(not at epoch), 5 independent runs
DataSet: MNLI(inference), AG News(topic classification)
- Score computation step: For early steps, results are not reliable than a random sampling. Only in AG News, it provides reasonable scores after 500 steps. MNLI can get better score than random one after >1500 steps.
- Preserved Fraction: To investigate the size of training sey affect to the performance. In smaller fraction of subset cases, not performed well. MNLI have faster timing that EL2N & GraNdβs performance overtake the randoms.
- Noise Case: highest scoring examples can caused the noise on result β so remove the best sub-dataset. Interesting point on the MNLI dataset, removing >3% dataset appears better performance than whole dataset case.
- When limited dataset, EL2N sometimes find the important examples by fine-tuning a few seeds.
Result
Author got different result at NLP β not worked (in Score Computation, Preserved Fraction) BUT in removing the noisy case, can achieve better perfromance when using a subset.
β Future goal; find a pruning mechanism in a single fine-tuning procedure.
Β
Opinion
- How can classified the dataset? β Is there any validation tools?
- Academical reason on different results between Vision and NLP, also between language inference and topic classification. (lower labels in NLP)
- Need to get more experience on NLP tasks, for considering proper pre-trained model on buisness levels. (required structured service)
Advanced
- Noisy Data: https://norman3.github.io/papers/docs/noisy_large_scale_dataset.html: answer?
- Reason: constant value on no pruning cases.
- more types of metric method on NLP (popular one)