Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difficulty in understanding starspace_embedding() behavior #6

Open
fzhang612 opened this issue Nov 30, 2018 · 6 comments
Open

difficulty in understanding starspace_embedding() behavior #6

fzhang612 opened this issue Nov 30, 2018 · 6 comments

Comments

@fzhang612
Copy link

fzhang612 commented Nov 30, 2018

I am trying to replicate the return value of starspace_embedding() function. Here is what I have found so far.

When training a model with ngrams = 1, starspace_embedding(model, 'word1 word2') = as.matrix(model)['word1', ] + as.matrix(model)['word2', ] normalized accordingly. However this doesn't hold when the model trained with ngrams > 1.

thanks in advance

@jwijffels
Copy link
Contributor

jwijffels commented Nov 30, 2018

If you want embeddings of ngrams and your model is trained with ngram > 1, you should probably use starspace_embedding(model, 'word1 word2', type = "ngram")
The embeddings are also governed by the parameter p which can be passed on to the starspace function and the default of it is 0.5. From the Starspace docs:
-p normalization parameter: we normalize sum of embeddings by deviding Size^p, when p=1, it's equivalent to taking average of embeddings; when p=0, it's equivalent to taking sum of embeddings. [0.5]

@fzhang612
Copy link
Author

Thanks for the response. However I now feel confused with the difference between starspace_embedding(model, 'word1 word2', type = 'document') and starspace_embedding(model, 'word1 word2', type = 'ngram'). If the latter is the embedding for a bigram word1_word2, trained with ngram = 2, what does the former represent and how is it calculated? Thanks.

@jwijffels
Copy link
Contributor

jwijffels commented Dec 1, 2018

Did you check by dividing your embedding summation by Size^p as I indicated. Size is for your case 2 as you have 2 Words and p is by default 0.5. That is what you get if you specify type=“document”, if you specify type=“ngram” starspace is using the hashing trick from fasttext to find out in which Bucket the ngram lies and then retrieves the embedding of that. You can inspect the c++ code for that.

@fzhang612
Copy link
Author

Yes, I did divide the embedding summation by Size^p. Let me rephrase my question in a clearer way.

If model is trained with similarity = 'dot', p = 0.5, ngrams = 1 then the following holds:
starspace_embedding(model, 'word_1 word_2', type = 'document') = (as.matrix(model)['word_1', ] + as.matrix(model)['word_2', ]) / sqrt(2)

however, if the model is trained with ngrams = 2, keeping all other parameters same, then the above equation doesn't hold.

What am I missing to understand the difference between ngrams=1 model vs ngrams=2 model?

Thanks

@jwijffels
Copy link
Contributor

jwijffels commented Dec 2, 2018

So to go short, for bigrams, the unigram words the embeddings are retrieved and for the bigram (if the model was trained with ngram > 1) it gets the embedding of the right hashed bucket.
That is Starspace stores only embeddings of 1 word, not of bigrams. For bigrams or ngrams they are hashed combinations of the words consisting of the ngram.

@jwijffels
Copy link
Contributor

jwijffels commented Dec 2, 2018

Nevertheless, although reproducing these hashed combinations from R is non-reproducible without touching some C++ code, I tried the following experiments and hoped at the last example that if I took the ngram embedding of the bigram, and add it to the embeddings of the unigrams, that would be the same as the document embedding but apparently this is not what happened. Maybe it would be nice to ask this to the Starspace authors themselves

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- gsub("\\.([[:digit:]]+)\\.", ". \\1.", x = dekamer$question)
dekamer$text <- strsplit(dekamer$text, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 1, p = 0.5,
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "cosine",
                        early_stopping = 0.8, ngram = 1, 
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding

## does not work as expected
## it really makes sense to ask this to starspace authors
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 2, p = 0,
                        dim = 10, minCount = 5)
emb_doc <- starspace_embedding(model, "federale politie", type = "document")
emb_ngram <- starspace_embedding(model, "federale politie", type = "ngram")
embedding_dictionary <- as.matrix(model)
emb_doc
manual <- rbind(embedding_dictionary[c("federale", "politie"), ], 
                emb_ngram)
colSums(manual)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants