Skip to content

Commit

Permalink
Immutable searchspace (#412)
Browse files Browse the repository at this point in the history
Fixes #371 by making `SubspaceDiscrete` stateless.

### Current Approach
* The `SubspaceDiscrete.metadata` attribute gets deprecated and the
responsibility of metadata handling is shifted to `Campaign`
* The new mechanism is not yet final (see out of scope below) but
designed in a way that allows to implement upcoming changes in a
non-breaking manner. In particular:
* The metadata handling mechanism is redesigned in that the actual
metadata representation is completely hidden from the user, i.e.
campaign manages the data in form of private attributes. This is to
avoid further lock-in into `pandas` as our search space backend and
prepares for future search space improvements by abstracting away the
specific implementation details, enabling us to easily offer other
backends (polars, databases, etc) in the future.
* The `allow_*` flags are not yet migrated to the `Campaign` class, but
the `AnnotatedSubspaceDiscrete` allows to migrate them in a follow-up PR
(#423) without causing much friction
* A new user-facing method `Campaign.toggle_discrete_candidates` now
allows convenient and dynamic control over the discrete candidate set,
avoiding any fiddling with the backend dataframes and index
manipulations. The positive effect can be seen in the much cleaner code
parts of the simulation package.

### Out of scope / (potentially) coming next
* Migration of `allow_*` flags in order to make `Campaign` the unique
place where the concept of metadata exists, i.e. campaigns will be the
only stateful objects. A PR taking care of this should follow soon
because the `get_candidates` signature of `SubspaceDiscrete` currently
makes not much sense, as it expects these flags in a context where
metadata does not exist.
* Once the flags are migrated, the `AnnotatedSubspaceDiscrete` might
become obsolete since the `Campaign` class can then theoretically filter
down the space before passing it to the recommender. This however
requires an efficient implementation that does not cause unnecessary
dataframe copies.
* Actually turning the state space classes `frozen`. There a few other
things that should be addressed at the same time (i.e. general cleanup
of the classes).
  • Loading branch information
AdrianSosic authored Nov 22, 2024
2 parents ca8b221 + a0bbc78 commit 98cb4ea
Show file tree
Hide file tree
Showing 28 changed files with 433 additions and 258 deletions.
12 changes: 9 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Example for a traditional mixture
- `add_noise_to_perturb_degenerate_rows` utility
- `benchmarks` subpackage for defining and running performance tests
`Campaign.toggle_discrete_candidates` to dynamically in-/exclude discrete candidates
- `DiscreteConstraint.get_valid` to conveniently access valid candidates

### Changed
- `SubstanceParameter` encodings are now computed exclusively with the
Expand All @@ -25,6 +27,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Rare bug arising from degenerate `SubstanceParameter.comp_df` rows that caused
wrong number of recommendations being returned
- `ContinuousConstraint`s can now be used in single point precision
- Search spaces are now stateless, preventing unintended side effects that could lead to
incorrect candidate sets when reused in different optimization contexts

### Deprecations
- Passing a dataframe via the `data` argument to `Objective.transform` is no longer
Expand All @@ -37,6 +41,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `SubstanceEncoding` value `MORGAN_FP`. As a replacement, `ECFP` with 1024 bits and
radius of 4 can be used.
- `SubstanceEncoding` value `RDKIT`. As a replacement, `RDKIT2DDESCRIPTORS` can be used.
- The `metadata` attribute of `SubspaceDiscrete` no longer exists. Metadata is now
exclusively handled by the `Campaign` class.

## [0.11.3] - 2024-11-06
### Fixed
Expand Down Expand Up @@ -301,7 +307,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Simulation no longer fails for targets in `MATCH` mode
- `closest_element` now works for array-like input of all kinds
- Structuring concrete subclasses no longer requires providing an explicit `type` field
- `_target(s)` attributes of `Objectives` are now de-/serialized without leading
- `_target(s)` attributes of `Objectives` are now (de-)serialized without leading
underscore to support user-friendly serialization strings
- Telemetry does not execute any code if it was disabled
- Running simulations no longer alters the states of the global random number generators
Expand Down Expand Up @@ -439,7 +445,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `mypy` for targets and intervals
- Tests for code blocks in README and user guides
- `hypothesis` strategies and roundtrip tests for targets, intervals, and dataframes
- De-/serialization of target subclasses via base class
- (De-)serialization of target subclasses via base class
- Docs building check now part of CI
- Automatic formatting checks for code examples in documentation
- Deserialization of classes with classmethod constructors can now be customized
Expand All @@ -464,7 +470,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Use pydoclint as flake8 plugin and not as a stand-alone linter
- Margins in documentation for desktop and mobile version
- `Interval`s can now also be deserialized from a bounds iterable
- `SubspaceDiscrete` and `SubspaceContinuous` now have de-/serialization methods
- `SubspaceDiscrete` and `SubspaceContinuous` now have (de-)serialization methods

### Removed
- Conda install instructions and version badge
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Besides functionality to perform a typical recommend-measure loop, BayBE's highl
- ⚙️ Custom surrogate models: Enhance your predictions through mechanistic understanding
- 📈 Comprehensive backtest, simulation and imputation utilities: Benchmark and find your best settings
- 📝 Fully typed and hypothesis-tested: Robust code base
- 🔄 All objects are fully de-/serializable: Useful for storing results in databases or use in wrappers like APIs
- 🔄 All objects are fully (de-)serializable: Useful for storing results in databases or use in wrappers like APIs


## ⚡ Quick Start
Expand Down
2 changes: 1 addition & 1 deletion baybe/acquisition/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def _get_botorch_acqf_class(
)


# Register de-/serialization hooks
# Register (un-)structure hooks
def _add_deprecation_hook(hook):
"""Add deprecation warnings to the default hook.
Expand Down
142 changes: 132 additions & 10 deletions baybe/campaign.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,27 @@

import gc
import json
from collections.abc import Collection
from functools import reduce
from typing import TYPE_CHECKING

import cattrs
import numpy as np
import pandas as pd
from attrs import define, field
from attrs import define, evolve, field
from attrs.converters import optional
from attrs.validators import instance_of
from typing_extensions import override

from baybe.constraints.base import DiscreteConstraint
from baybe.exceptions import IncompatibilityError
from baybe.objectives.base import Objective, to_objective
from baybe.parameters.base import Parameter
from baybe.recommenders.base import RecommenderProtocol
from baybe.recommenders.meta.base import MetaRecommender
from baybe.recommenders.meta.sequential import TwoPhaseMetaRecommender
from baybe.recommenders.pure.bayesian.base import BayesianRecommender
from baybe.searchspace._annotated import AnnotatedSubspaceDiscrete
from baybe.searchspace.core import (
SearchSpace,
SearchSpaceType,
Expand All @@ -35,12 +39,20 @@
telemetry_record_recommended_measurement_percentage,
telemetry_record_value,
)
from baybe.utils.basic import is_all_instance
from baybe.utils.boolean import eq_dataframe
from baybe.utils.dataframe import filter_df, fuzzy_row_match
from baybe.utils.plotting import to_string

if TYPE_CHECKING:
from botorch.posteriors import Posterior

# Metadata columns
_RECOMMENDED = "recommended"
_MEASURED = "measured"
_EXCLUDED = "excluded"
_METADATA_COLUMNS = [_RECOMMENDED, _MEASURED, _EXCLUDED]


@define
class Campaign(SerialMixin):
Expand Down Expand Up @@ -77,6 +89,9 @@ class Campaign(SerialMixin):
"""The employed recommender"""

# Metadata
_searchspace_metadata: pd.DataFrame = field(init=False, eq=eq_dataframe)
"""Metadata tracking the experimentation status of the search space."""

n_batches_done: int = field(default=0, init=False)
"""The number of already processed batches."""

Expand All @@ -94,11 +109,44 @@ class Campaign(SerialMixin):
)
"""The cached recommendations."""

@_searchspace_metadata.default
def _default_searchspace_metadata(self) -> pd.DataFrame:
"""Create a fresh metadata object."""
df = pd.DataFrame(
False,
index=self.searchspace.discrete.exp_rep.index,
columns=_METADATA_COLUMNS,
)
df.loc[:, _EXCLUDED] = self.searchspace.discrete._excluded
return df

@override
def __str__(self) -> str:
recommended_count = sum(self._searchspace_metadata[_RECOMMENDED])
measured_count = sum(self._searchspace_metadata[_MEASURED])
excluded_count = sum(self._searchspace_metadata[_EXCLUDED])
n_elements = len(self._searchspace_metadata)
searchspace_fields = [
to_string(
"Recommended:",
f"{recommended_count}/{n_elements}",
single_line=True,
),
to_string(
"Measured:",
f"{measured_count}/{n_elements}",
single_line=True,
),
to_string(
"Excluded:",
f"{excluded_count}/{n_elements}",
single_line=True,
),
]
metadata_fields = [
to_string("Batches done", self.n_batches_done, single_line=True),
to_string("Fits done", self.n_fits_done, single_line=True),
to_string("Discrete Subspace Meta Data", *searchspace_fields),
]
metadata = to_string("Meta Data", *metadata_fields)
fields = [metadata, self.searchspace, self.objective, self.recommender]
Expand Down Expand Up @@ -196,13 +244,6 @@ def add_measurements(
f" the provided dataframe."
)

# Update meta data
# TODO: refactor responsibilities
if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
self.searchspace.discrete.mark_as_measured(
data, numerical_measurements_must_be_within_tolerance
)

# Read in measurements and add them to the database
self.n_batches_done += 1
to_insert = data.copy()
Expand All @@ -213,6 +254,16 @@ def add_measurements(
[self._measurements_exp, to_insert], axis=0, ignore_index=True
)

# Update metadata
if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
idxs_matched = fuzzy_row_match(
self.searchspace.discrete.exp_rep,
data,
self.parameters,
numerical_measurements_must_be_within_tolerance,
)
self._searchspace_metadata.loc[idxs_matched, _MEASURED] = True

# Telemetry
telemetry_record_value(TELEM_LABELS["COUNT_ADD_RESULTS"], 1)
telemetry_record_recommended_measurement_percentage(
Expand All @@ -222,6 +273,65 @@ def add_measurements(
numerical_measurements_must_be_within_tolerance,
)

def toggle_discrete_candidates( # noqa: DOC501
self,
constraints: Collection[DiscreteConstraint] | pd.DataFrame,
exclude: bool,
complement: bool = False,
dry_run: bool = False,
) -> pd.DataFrame:
"""In-/exclude certain discrete points in/from the candidate set.
Args:
constraints: A filtering mechanism determining the candidates subset to be
in-/excluded. Can be either a collection of
:class:`~baybe.constraints.base.DiscreteConstraint` or a dataframe.
For the latter, see :func:`~baybe.utils.dataframe.filter_df`
for details.
exclude: If ``True``, the specified candidates are excluded.
If ``False``, the candidates are considered for recommendation.
complement: If ``True``, the filtering mechanism is inverted so that
the complement of the candidate subset specified by the filter is
toggled. For details, see :func:`~baybe.utils.dataframe.filter_df`.
dry_run: If ``True``, the target subset is only extracted but not
affected. If ``False``, the candidate set is updated correspondingly.
Useful for setting up the correct filtering mechanism.
Returns:
A new dataframe containing the discrete candidate set passing through the
specified filter.
"""
df = self.searchspace.discrete.exp_rep

if isinstance(constraints, pd.DataFrame):
# Determine the candidate subset to be toggled
points = filter_df(df, constraints, complement)

elif isinstance(constraints, Collection) and is_all_instance(
constraints, DiscreteConstraint
):
# TODO: Should be taken over by upcoming `SubspaceDiscrete.filter` method,
# automatically choosing the appropriate backend (polars/pandas/...)

# Filter the search space dataframe according to the given constraint
idx = reduce(
lambda x, y: x.intersection(y), (c.get_valid(df) for c in constraints)
)

# Determine the candidate subset to be toggled
points = df.drop(index=idx) if complement else df.loc[idx].copy()

else:
raise TypeError(
"Candidate toggling is not implemented for the given type of "
"constraint specifications."
)

if not dry_run:
self._searchspace_metadata.loc[points.index, _EXCLUDED] = exclude

return points

def recommend(
self,
batch_size: int,
Expand Down Expand Up @@ -260,10 +370,18 @@ def recommend(
self.n_fits_done += 1
self._measurements_exp.fillna({"FitNr": self.n_fits_done}, inplace=True)

# Prepare the search space according to the current campaign state
annotated_searchspace = evolve(
self.searchspace,
discrete=AnnotatedSubspaceDiscrete.from_subspace(
self.searchspace.discrete, self._searchspace_metadata
),
)

# Get the recommended search space entries
rec = self.recommender.recommend(
batch_size,
self.searchspace,
annotated_searchspace,
self.objective,
self._measurements_exp,
pending_experiments,
Expand All @@ -272,6 +390,10 @@ def recommend(
# Cache the recommendations
self._cached_recommendation = rec.copy()

# Update metadata
if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
self._searchspace_metadata.loc[rec.index, _RECOMMENDED] = True

# Telemetry
telemetry_record_value(TELEM_LABELS["COUNT_RECOMMEND"], 1)
telemetry_record_value(TELEM_LABELS["BATCH_SIZE"], batch_size)
Expand Down Expand Up @@ -361,7 +483,7 @@ def _drop_version(dict_: dict) -> dict:
return dict_


# Register de-/serialization hooks
# Register (un-)structure hooks
unstructure_hook = cattrs.gen.make_dict_unstructure_fn(
Campaign, converter, _cattrs_include_init_false=True
)
Expand Down
18 changes: 15 additions & 3 deletions baybe/constraints/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,17 +93,29 @@ class DiscreteConstraint(Constraint, ABC):
eval_during_modeling: ClassVar[bool] = False
# See base class.

def get_valid(self, df: pd.DataFrame, /) -> pd.Index:
"""Get the indices of dataframe entries that are valid under the constraint.
Args:
df: A dataframe where each row represents a parameter configuration.
Returns:
The dataframe indices of rows that fulfill the constraint.
"""
invalid = self.get_invalid(df)
return df.index.drop(invalid)

@abstractmethod
def get_invalid(self, data: pd.DataFrame) -> pd.Index:
"""Get the indices of dataframe entries that are invalid under the constraint.
Args:
data: A dataframe where each row represents a particular parameter
combination.
data: A dataframe where each row represents a parameter configuration.
Returns:
The dataframe indices of rows where the constraint is violated.
The dataframe indices of rows that violate the constraint.
"""
# TODO: Should switch backends (pandas/polars/...) behind the scenes

def get_invalid_polars(self) -> pl.Expr:
"""Translate the constraint to Polars expression identifying undesired rows.
Expand Down
2 changes: 1 addition & 1 deletion baybe/kernels/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ class CompositeKernel(Kernel, ABC):
"""Abstract base class for all composite kernels."""


# Register de-/serialization hooks
# Register (un-)structure hooks
converter.register_structure_hook(Kernel, get_base_structure_hook(Kernel))
converter.register_unstructure_hook(Kernel, unstructure_base)

Expand Down
2 changes: 1 addition & 1 deletion baybe/objectives/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def to_objective(x: Target | Objective, /) -> Objective:
return x if isinstance(x, Objective) else x.to_objective()


# Register de-/serialization hooks
# Register (un-)structure hooks
converter.register_structure_hook(Objective, structure_objective)
converter.register_unstructure_hook(
Objective,
Expand Down
Loading

0 comments on commit 98cb4ea

Please sign in to comment.