π¦ DL
GloVe: Global Vectors for Word Representation
Title:
Keyword: semantic, syntactic, matrix factorization, window-based
Concept
Learning vector space representations of words β before; vector arithmetic
1) Matrix Factorization Method
rows; words OR terms, columns; different document
Advantage] At HAL, raw co-occurrence counts are compressed evenly on distribution in a smaller interval.
Problem] most freq. words give disproportionate amount for similarity measurment (ex. βtheβ, βandβ)
2) Shallow Window-Based Methods
CBOW model(skip-gram & continuous BOW model): single-layer arch. based on the inner product
Advantage] In word analogy task, easy to learn linguistic patterns as linear relationship
Problem] DO NOT operater directly on the co-occurrence statistics (not for whole data!)
β GloVe: 1) global matrix factorization(LSA) + 2) local context window(skip-gram model Mikolovet al)
β Try to combine these advantages!
Progress
: probability that word j appears in context of word i
β Def.) : co-occurrence prob. of word i and j
if co-occur.prob ~= 1: word k have relationship both or neither
else: word k have relationship on one word(
i or j
)
co-occur. has dependency on i, j, k | |
domain should be on vector space and use inner product for calculation | |
Function should have homomorphism | |
combine eq. 2) and 3) | |
if F is exponantial | |
log(X) is independent on k, so think as bias |
In last equation, if , value would be infinite. β So, apply the weight function with cost function
for weight func, 1) f(0) = 0 (for conti.), 2) non-decreasing func. (for no overweight), 3) not give large weight for frequent X β
Result
Complexity:
Evaluation
1) Word Analogy: estimate the word on given context (classify by semantic and syntactic question)
2) Word Similarity: dataset; WordSim-353, MC, RG
3) NER(Named Entity Recognition)
Tokeization β Small Letter β choose top 400,000 words for making co-occurrence count matrix (use decreasing weight func.) β 50 repeats on dim < 300, else 100 repeats && window range; 10 words for each sides
Β
Opinion
- For weight function, how the performance appears the difference on current function and half-softmax function?
- For bias , how can set these? fixed value or individual value for some condition?
Advanced
Β