difficulty in understanding starspace_embedding() behavior #6

fzhang612 · 2018-11-30T08:03:09Z

I am trying to replicate the return value of starspace_embedding() function. Here is what I have found so far.

When training a model with ngrams = 1, starspace_embedding(model, 'word1 word2') = as.matrix(model)['word1', ] + as.matrix(model)['word2', ] normalized accordingly. However this doesn't hold when the model trained with ngrams > 1.

thanks in advance

The text was updated successfully, but these errors were encountered:

jwijffels · 2018-11-30T09:42:46Z

If you want embeddings of ngrams and your model is trained with ngram > 1, you should probably use starspace_embedding(model, 'word1 word2', type = "ngram")
The embeddings are also governed by the parameter p which can be passed on to the starspace function and the default of it is 0.5. From the Starspace docs:
-p normalization parameter: we normalize sum of embeddings by deviding Size^p, when p=1, it's equivalent to taking average of embeddings; when p=0, it's equivalent to taking sum of embeddings. [0.5]

fzhang612 · 2018-12-01T07:09:10Z

Thanks for the response. However I now feel confused with the difference between starspace_embedding(model, 'word1 word2', type = 'document') and starspace_embedding(model, 'word1 word2', type = 'ngram'). If the latter is the embedding for a bigram word1_word2, trained with ngram = 2, what does the former represent and how is it calculated? Thanks.

jwijffels · 2018-12-01T09:37:43Z

Did you check by dividing your embedding summation by Size^p as I indicated. Size is for your case 2 as you have 2 Words and p is by default 0.5. That is what you get if you specify type=“document”, if you specify type=“ngram” starspace is using the hashing trick from fasttext to find out in which Bucket the ngram lies and then retrieves the embedding of that. You can inspect the c++ code for that.

fzhang612 · 2018-12-02T02:41:07Z

Yes, I did divide the embedding summation by Size^p. Let me rephrase my question in a clearer way.

If model is trained with similarity = 'dot', p = 0.5, ngrams = 1 then the following holds:
starspace_embedding(model, 'word_1 word_2', type = 'document') = (as.matrix(model)['word_1', ] + as.matrix(model)['word_2', ]) / sqrt(2)

however, if the model is trained with ngrams = 2, keeping all other parameters same, then the above equation doesn't hold.

What am I missing to understand the difference between ngrams=1 model vs ngrams=2 model?

Thanks

jwijffels · 2018-12-02T19:38:00Z

starspace_embedding(type = "document", ...) does the following sequence of relevant calls

ruimtehol/src/rcpp_textspace.cpp

Line 441 in e3c95ba

starspace::Matrix<starspace::Real> vec = sp->getDocVector(input, " \t");

calls getDocVector
getDocVector is

ruimtehol/src/Starspace/src/starspace.cpp

Line 227 in e3c95ba

parseDoc(line, ids, sep);

which calls parseDoc
parseDoc basically splits the text alongside spaces and tab characters

ruimtehol/src/Starspace/src/starspace.cpp

Line 215 in e692f03

void StarSpace::parseDoc(

and calls parse
parse basically maps single words to identifiers in the dictionary

ruimtehol/src/Starspace/src/parser.cpp

Line 159 in 9faf7a5

bool DataParser::parse(

and at the end does addNgrams
addNgrams is the relevant function here in your question. You find it at

ruimtehol/src/Starspace/src/parser.cpp

Line 89 in 9faf7a5

void DataParser::addNgrams(

it hashes all words from your bigram and maps the hashed combination of the 2 terms to the right bucket (similarly as in fasttext).
Once this is done, the embeddings of the unigrams/bigrams are normalised see here:

ruimtehol/src/Starspace/src/model.cpp

Line 144 in 72a56c4

void EmbedModel::projectLHS(const std::vector<Base>& ws, Matrix<Real>& retval) {

(in case of dot division is done by dividing the sum by the Size^p, for cosine similarity it is divided by the euclidean norm)
starspace_embedding(type = "ngram", ...) does

ruimtehol/src/rcpp_textspace.cpp

Line 459 in e3c95ba

starspace::MatrixRow vec = sp->getNgramVector(input);

which is calling this

ruimtehol/src/Starspace/src/starspace.cpp

Line 231 in e692f03

MatrixRow StarSpace::getNgramVector(const string& phrase) {

you'll see the same thing. The bigrams are mapped to a hashed combination of the 2 terms (similarly as in fasttext)

So to go short, for bigrams, the unigram words the embeddings are retrieved and for the bigram (if the model was trained with ngram > 1) it gets the embedding of the right hashed bucket.
That is Starspace stores only embeddings of 1 word, not of bigrams. For bigrams or ngrams they are hashed combinations of the words consisting of the ngram.

jwijffels · 2018-12-02T19:49:21Z

Nevertheless, although reproducing these hashed combinations from R is non-reproducible without touching some C++ code, I tried the following experiments and hoped at the last example that if I took the ngram embedding of the bigram, and add it to the embeddings of the unigrams, that would be the same as the document embedding but apparently this is not what happened. Maybe it would be nice to ask this to the Starspace authors themselves

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- gsub("\\.([[:digit:]]+)\\.", ". \\1.", x = dekamer$question)
dekamer$text <- strsplit(dekamer$text, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 1, p = 0.5,
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "cosine",
                        early_stopping = 0.8, ngram = 1, 
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding

## does not work as expected
## it really makes sense to ask this to starspace authors
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 2, p = 0,
                        dim = 10, minCount = 5)
emb_doc <- starspace_embedding(model, "federale politie", type = "document")
emb_ngram <- starspace_embedding(model, "federale politie", type = "ngram")
embedding_dictionary <- as.matrix(model)
emb_doc
manual <- rbind(embedding_dictionary[c("federale", "politie"), ], 
                emb_ngram)
colSums(manual)

jwijffels added a commit that referenced this issue Dec 4, 2018

Some documentation on embedding calculation as noted in #6

cd72e23

jwijffels mentioned this issue May 7, 2019

How are document (vectors) represented in document classification facebookresearch/StarSpace#246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difficulty in understanding starspace_embedding() behavior #6

difficulty in understanding starspace_embedding() behavior #6

fzhang612 commented Nov 30, 2018 •

edited

Loading

jwijffels commented Nov 30, 2018 •

edited

Loading

fzhang612 commented Dec 1, 2018

jwijffels commented Dec 1, 2018 •

edited

Loading

fzhang612 commented Dec 2, 2018

jwijffels commented Dec 2, 2018 •

edited

Loading

jwijffels commented Dec 2, 2018 •

edited

Loading

difficulty in understanding starspace_embedding() behavior #6

difficulty in understanding starspace_embedding() behavior #6

Comments

fzhang612 commented Nov 30, 2018 • edited Loading

jwijffels commented Nov 30, 2018 • edited Loading

fzhang612 commented Dec 1, 2018

jwijffels commented Dec 1, 2018 • edited Loading

fzhang612 commented Dec 2, 2018

jwijffels commented Dec 2, 2018 • edited Loading

jwijffels commented Dec 2, 2018 • edited Loading

fzhang612 commented Nov 30, 2018 •

edited

Loading

jwijffels commented Nov 30, 2018 •

edited

Loading

jwijffels commented Dec 1, 2018 •

edited

Loading

jwijffels commented Dec 2, 2018 •

edited

Loading

jwijffels commented Dec 2, 2018 •

edited

Loading