Towards a Human-like Open-Domain Chatbot

date

May 25, 2023

slug

nlp-3

author

status

Public

Title:

Keyword: Chatbot, end-to-end model, perplexity-SSA, interactive model, sample-and-rank, random sampling with temperature, cherry-picked, manual coordinate-descent search

Concept

Open-domain chatbot can create conservation on any topic → knowledge-based, retrieval-based, rule-based systems.

Meena

trained end-to-end on 40B words

seq2seq model + Evolved Transformer

trained by multi-turn conversation: context ↔ response

SSA: measurement tool based on static and interactive.

This paper contributes on (1) simple human evaluation metric, (2) provide evidence correlation between perplexity and SSA, (3) end-to-end neural model with low perplexity has best performance.

Progress

Evaluating chatbots

[1] basic human (sensibleness, consistent, static)

MTB: collection of context with 1~3 turns

[2] subjective human (specificity), interactive

required 14 ~ 28 turns, started with ‘Hi!’

evaluate for quality, fluency, diversity, relatedness, and empathy

combine [1] and [2] labeling or test → called SSA

Meena chatbot : developed by Google

[1] Training Data (social media conversation)

Source data are structured by “message tree” [341GB, 40B words]

Filtering: subwords 2 ~ 128, alphabets 70% up, no URL, not named ‘bot’, not repeated > 100

Tokenization: BPE (byte-pair-encoding)

[2] Model Architecture

1ET(Evolved Transformer) encoder + 13ET decoder, 2560 hidden sizes, 32 attention heads

The hyperparameters were found via manual coordinate-descent search.

[3] Decoding

typical problem; not specific, bland responses → use more sophisticated decoding algorithms

sample independent responses by plain random sampling with temperature (:more gaps between min & max) : choose N =20 , T = 0.88

if T over than 1, assign to much probability on incorrect token,

else, more safer but use common words such as prepositions

** beam search: provide too common and boring answer (same effect as low T)

[4] Other strategy: If cherry picked conversation is provided, then select them after they completed.

Result

Focus on sensibleness and specificity for crowd workers’ evaluation (SSA)

Best performance in ever Chatbot (By 2019)

BUT result can be biased mostly on first turn’s context.

It would be more better to add direction such as humor, empathy, knowledge skills

Opinion

Correrlation between PPL and SSA can be also samed at advanced chatbot that include the long-term memory?

It is also best way to use ‘random sampling with temperature’ at decoding layer in iogical conversations (base knowledge or math, etc.)?

Cons) These conversation needs one specified answer. Not interested on common or incommon. → Beam search or Greedy can be more useful.
Then do we need to classfied the conversation’s topic? → more hyperparameter

How can I use multiple sampling method in single model?

At time-step? or space-step?

On end-to-end model, PPL is the best automatic method in evaluation metric!

Advanced

Perlexity: https://supkoon.tistory.com/41

Random sampling with temperature

solve cross-turn repetition: by filtering the case which has common sub-sequences in two turns.

Other Sampling Method: https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc