📦 DL

Towards a Human-like Open-Domain Chatbot

date
May 25, 2023
slug
nlp-3
author
status
Public
tags
NLP
Chatbot
summary
This paper contributes on (1) simple human evaluation metric, (2) provide evidence correlation between perplexity and SSA, (3) end-to-end neural model with low perplexity has best performance.
type
Post
thumbnail
DALL·E 2023-05-25 17.56.36 - Towards a Human-like Open-Domain Chatbot like robot and give 사이버펑크 style .png
category
📦 DL
updatedAt
May 25, 2023 08:56 AM

Title:

Keyword: Chatbot, end-to-end model, perplexity-SSA, interactive model, sample-and-rank, random sampling with temperature, cherry-picked, manual coordinate-descent search

Concept

Open-domain chatbot can create conservation on any topic → knowledge-based, retrieval-based, rule-based systems.
Meena
  • trained end-to-end on 40B words
  • seq2seq model + Evolved Transformer
  • trained by multi-turn conversation: context ↔ response
SSA: measurement tool based on static and interactive.
This paper contributes on (1) simple human evaluation metric, (2) provide evidence correlation between perplexity and SSA, (3) end-to-end neural model with low perplexity has best performance.

Progress

Evaluating chatbots

[1] basic human (sensibleness, consistent, static)
MTB: collection of context with 1~3 turns
[2] subjective human (specificity), interactive
required 14 ~ 28 turns, started with ‘Hi!’
evaluate for quality, fluency, diversity, relatedness, and empathy
combine [1] and [2] labeling or test → called SSA

Meena chatbot : developed by Google

[1] Training Data (social media conversation)
Source data are structured by “message tree” [341GB, 40B words]
Filtering: subwords 2 ~ 128, alphabets 70% up, no URL, not named ‘bot’, not repeated > 100
Tokenization: BPE (byte-pair-encoding)
[2] Model Architecture
1ET(Evolved Transformer) encoder + 13ET decoder, 2560 hidden sizes, 32 attention heads
The hyperparameters were found via manual coordinate-descent search.
[3] Decoding
typical problem; not specific, bland responses → use more sophisticated decoding algorithms
sample independent responses by plain random sampling with temperature (:more gaps between min & max) : choose N =20 , T = 0.88
if T over than 1, assign to much probability on incorrect token,
else, more safer but use common words such as prepositions
** beam search: provide too common and boring answer (same effect as low T)
[4] Other strategy: If cherry picked conversation is provided, then select them after they completed.

Result

  • Focus on sensibleness and specificity for crowd workers’ evaluation (SSA)
  • Best performance in ever Chatbot (By 2019)
  • BUT result can be biased mostly on first turn’s context.
  • It would be more better to add direction such as humor, empathy, knowledge skills
notion image

 

Opinion

  • Correrlation between PPL and SSA can be also samed at advanced chatbot that include the long-term memory?
  • It is also best way to use ‘random sampling with temperature’ at decoding layer in iogical conversations (base knowledge or math, etc.)?
    • Cons) These conversation needs one specified answer. Not interested on common or incommon. → Beam search or Greedy can be more useful.
    • Then do we need to classfied the conversation’s topic? → more hyperparameter
  • How can I use multiple sampling method in single model?
    • At time-step? or space-step?
  • On end-to-end model, PPL is the best automatic method in evaluation metric!

Advanced

Random sampling with temperature
notion image
solve cross-turn repetition: by filtering the case which has common sub-sequences in two turns.