📦 DL
Towards a Human-like Open-Domain Chatbot
date
May 25, 2023
slug
nlp-3
author
status
Public
tags
NLP
Chatbot
summary
This paper contributes on (1) simple human evaluation metric, (2) provide evidence correlation between perplexity and SSA, (3) end-to-end neural model with low perplexity has best performance.
type
Post
category
📦 DL
updatedAt
May 25, 2023 08:56 AM
Title:
Keyword: Chatbot, end-to-end model, perplexity-SSA, interactive model, sample-and-rank, random sampling with temperature, cherry-picked, manual coordinate-descent search
Concept
Open-domain chatbot can create conservation on any topic → knowledge-based, retrieval-based, rule-based systems.
Meena
- trained end-to-end on 40B words
- seq2seq model + Evolved Transformer
- trained by multi-turn conversation: context ↔ response
SSA: measurement tool based on static and interactive.
This paper contributes on (1) simple human evaluation metric, (2) provide evidence correlation between perplexity and SSA, (3) end-to-end neural model with low perplexity has best performance.
Progress
Evaluating chatbots
[1] basic human (sensibleness, consistent, static)
MTB: collection of context with 1~3 turns
[2] subjective human (specificity), interactive
required 14 ~ 28 turns, started with ‘Hi!’
evaluate for quality, fluency, diversity, relatedness, and empathy
combine [1] and [2] labeling or test → called
SSA
Meena chatbot : developed by Google
[1] Training Data (social media conversation)
Source data are structured by “message tree” [341GB, 40B words]
Filtering: subwords 2 ~ 128, alphabets 70% up, no URL, not named ‘bot’, not repeated > 100
Tokenization: BPE (byte-pair-encoding)
[2] Model Architecture
1ET(Evolved Transformer) encoder + 13ET decoder, 2560 hidden sizes, 32 attention heads
The hyperparameters were found via manual coordinate-descent search.
[3] Decoding
typical problem; not specific, bland responses → use more sophisticated decoding algorithms
sample independent responses by plain random sampling with temperature (:more gaps between min & max) : choose N =20 , T = 0.88
if T over than 1, assign to much probability on incorrect token,
else, more safer but use common words such as prepositions
** beam search: provide too common and boring answer (same effect as low T)
[4] Other strategy: If cherry picked conversation is provided, then select them after they completed.
Result
- Focus on sensibleness and specificity for crowd workers’ evaluation (SSA)
- Best performance in ever Chatbot (By 2019)
- BUT result can be biased mostly on first turn’s context.
- It would be more better to add direction such as humor, empathy, knowledge skills
Opinion
- Correrlation between PPL and SSA can be also samed at advanced chatbot that include the long-term memory?
- It is also best way to use ‘random sampling with temperature’ at decoding layer in iogical conversations (base knowledge or math, etc.)?
- Cons) These conversation needs one specified answer. Not interested on common or incommon. → Beam search or Greedy can be more useful.
- Then do we need to classfied the conversation’s topic? → more hyperparameter
- How can I use multiple sampling method in single model?
- At time-step? or space-step?
- On end-to-end model, PPL is the best automatic method in evaluation metric!
Advanced
Perlexity: https://supkoon.tistory.com/41
Random sampling with temperature
solve cross-turn repetition: by filtering the case which has common sub-sequences in two turns.
Other Sampling Method: https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc