GloVe: Global Vectors for Word Representation

date

Sep 26, 2023

slug

nlp-4

author

status

Public

Title:

Keyword: semantic, syntactic, matrix factorization, window-based

Paper: https://nlp.stanford.edu/pubs/glove.pdf

Concept

Learning vector space representations of words → before; vector arithmetic

1) Matrix Factorization Method

rows; words OR terms, columns; different document

Advantage] At HAL, raw co-occurrence counts are compressed evenly on distribution in a smaller interval.

Problem] most freq. words give disproportionate amount for similarity measurment (ex. “the”, “and”)

2) Shallow Window-Based Methods

CBOW model(skip-gram & continuous BOW model): single-layer arch. based on the inner product

Advantage] In word analogy task, easy to learn linguistic patterns as linear relationship

Problem] DO NOT operater directly on the co-occurrence statistics (not for whole data!)

→ GloVe: 1) global matrix factorization(LSA) + 2) local context window(skip-gram model Mikolovet al)

→ Try to combine these advantages!

Progress

: probability that word j appears in context of word i

→ Def.) : co-occurrence prob. of word i and j

if co-occur.prob ~= 1: word k have relationship both or neither

else: word k have relationship on one word(i or j)

	co-occur. has dependency on i, j, k
	domain should be on vector space and use inner product for calculation
	Function should have homomorphism
	combine eq. 2) and 3)
	if F is exponantial
	log(X) is independent on k, so think as bias

In last equation, if , value would be infinite. → So, apply the weight function with cost function

for weight func, 1) f(0) = 0 (for conti.), 2) non-decreasing func. (for no overweight), 3) not give large weight for frequent X →

Result

Complexity:

Evaluation

1) Word Analogy: estimate the word on given context (classify by semantic and syntactic question)

2) Word Similarity: dataset; WordSim-353, MC, RG

3) NER(Named Entity Recognition)

Tokeization → Small Letter → choose top 400,000 words for making co-occurrence count matrix (use decreasing weight func.) → 50 repeats on dim < 300, else 100 repeats && window range; 10 words for each sides

~~Best performance in 2014!~~

Opinion

For weight function, how the performance appears the difference on current function and half-softmax function?

For bias , how can set these? fixed value or individual value for some condition?