Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pclean.cc integration tests are failing #233

Open
emilyfertig opened this issue Oct 4, 2024 · 3 comments
Open

pclean.cc integration tests are failing #233

emilyfertig opened this issue Oct 4, 2024 · 3 comments
Assignees

Comments

@emilyfertig
Copy link

The assertions in CleanRelation::logp_gibbs_exact* are failing, apparently due to roundoff error. When I comment out the assertions, it still crashes with "Warning: all Dirichlet hyperparameters give nans!"

Below is the log when I run

./bazel-bin/pclean/pclean --schema=assets/flights.schema --obs=assets/flights_dirty.10.csv --iters=5 --output=/tmp/flights.out

on the branch 100424-emilyaf-bigram-debug.

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5582309.321400
Starting iteration 1, model score = -5582187.998050
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
calling logp gibbs exact on act_arr_time
true is 4:09 p.m. noisy is 4:09 p.m.
calling logp gibbs exact on act_arr_time
true is 9:28 a.m. noisy is 9:28 a.m.
calling logp gibbs exact on act_arr_time
true is 2:50 p.m. noisy is 2:50 p.m.
in relation act_arr_time_emission
oh no should be less than 2.24782e-11 but is 1
logp0 is -101233, logp score is -101234
pclean: clean_relation.hh:345: double CleanRelation<T>::logp_gibbs_exact_current(const std::vector<std::vector<int> >&) [with T = std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >]: Assertion `false' failed.
Aborted
@emilyfertig
Copy link
Author

Maybe it isn't roundoff error, when I change the max length of the Time string from 40 to 30 the logp discrepancy is higher:

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5781017.216540
Starting iteration 1, model score = -5780895.893191
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
in relation sched_arr_time_emission
oh no should be less than 2.24782e-11 but is 18
logp0 is -101233, logp score is -101215

@emilyfertig
Copy link
Author

I commented out the failing logp checks in CleanRelation to try to get some insight into why the Dirichlet hparams were NaN, and it appears there are NaNs in the counts vector of the bigram insertions distribution.

Transitioning bigram
Transitioning bigram insertions
counts in dirichlet cat is 
-nan 0 0 -nan -nan 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 0 0 0 0 0 -nan 0 0 -nan 0 0 -nan 0 0 0 -nan 0 0 0 0 0 -nan 0 0 0 0 0 0 -nan 0 0 0 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 -nan 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 
Warning: all Dirichlet hyperparameters give nans!
pclean: distributions/dirichlet_categorical.cc:62: virtual void DirichletCategorical::transition_hyperparameters(std::mt19937*): Assertion `false' failed.
Aborted

@emilyfertig
Copy link
Author

This looks weird to me: https://github.com/probcomp/hierarchical-irm/blob/master/cxx/emissions/bigram_string.cc#L138

  double total_prob = 0.0;
  for (auto& a : alignments) {
    a.cost = exp(a.cost);  // Turn all costs into non-log probabilities
    total_prob += a.cost;
  }

  for (const auto& a : alignments) {
    double w = weight * a.cost / total_prob;

We're adding up exp(a.cost) to get total prob, but then we're weighting a.cost without the exp (and dividing by total_prob). I suspect it should be exp(a.cost) on the last line of the snippet too, but when I make that change just to try it, the code hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants