Immutable searchspace (#412)

Fixes #371 by making `SubspaceDiscrete` stateless. ### Current Approach * The `SubspaceDiscrete.metadata` attribute gets deprecated and the responsibility of metadata handling is shifted to `Campaign` * The new mechanism is not yet final (see out of scope below) but designed in a way that allows to implement upcoming changes in a non-breaking manner. In particular: * The metadata handling mechanism is redesigned in that the actual metadata representation is completely hidden from the user, i.e. campaign manages the data in form of private attributes. This is to avoid further lock-in into `pandas` as our search space backend and prepares for future search space improvements by abstracting away the specific implementation details, enabling us to easily offer other backends (polars, databases, etc) in the future. * The `allow_*` flags are not yet migrated to the `Campaign` class, but the `AnnotatedSubspaceDiscrete` allows to migrate them in a follow-up PR (#423) without causing much friction * A new user-facing method `Campaign.toggle_discrete_candidates` now allows convenient and dynamic control over the discrete candidate set, avoiding any fiddling with the backend dataframes and index manipulations. The positive effect can be seen in the much cleaner code parts of the simulation package. ### Out of scope / (potentially) coming next * Migration of `allow_*` flags in order to make `Campaign` the unique place where the concept of metadata exists, i.e. campaigns will be the only stateful objects. A PR taking care of this should follow soon because the `get_candidates` signature of `SubspaceDiscrete` currently makes not much sense, as it expects these flags in a context where metadata does not exist. * Once the flags are migrated, the `AnnotatedSubspaceDiscrete` might become obsolete since the `Campaign` class can then theoretically filter down the space before passing it to the recommender. This however requires an efficient implementation that does not cause unnecessary dataframe copies. * Actually turning the state space classes `frozen`. There a few other things that should be addressed at the same time (i.e. general cleanup of the classes).
emdgroup · Nov 22, 2024 · 98cb4ea · 98cb4ea
2 parents ca8b221 + a0bbc78
commit 98cb4ea
Show file tree

Hide file tree

Showing 28 changed files with 433 additions and 258 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Example for a traditional mixture
 - `add_noise_to_perturb_degenerate_rows` utility
 - `benchmarks` subpackage for defining and running performance tests
+– `Campaign.toggle_discrete_candidates` to dynamically in-/exclude discrete candidates
+- `DiscreteConstraint.get_valid` to conveniently access valid candidates
 
 ### Changed
 - `SubstanceParameter` encodings are now computed exclusively with the
@@ -25,6 +27,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Rare bug arising from degenerate `SubstanceParameter.comp_df` rows that caused
   wrong number of recommendations being returned
 - `ContinuousConstraint`s can now be used in single point precision
+- Search spaces are now stateless, preventing unintended side effects that could lead to
+  incorrect candidate sets when reused in different optimization contexts 
 
 ### Deprecations
 - Passing a dataframe via the `data` argument to `Objective.transform` is no longer
@@ -37,6 +41,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `SubstanceEncoding` value `MORGAN_FP`. As a replacement, `ECFP` with 1024 bits and
   radius of 4 can be used.
 - `SubstanceEncoding` value `RDKIT`. As a replacement, `RDKIT2DDESCRIPTORS` can be used.
+- The `metadata` attribute of `SubspaceDiscrete` no longer exists. Metadata is now
+  exclusively handled by the `Campaign` class.
 
 ## [0.11.3] - 2024-11-06
 ### Fixed
@@ -301,7 +307,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Simulation no longer fails for targets in `MATCH` mode
 - `closest_element` now works for array-like input of all kinds
 - Structuring concrete subclasses no longer requires providing an explicit `type` field
-- `_target(s)` attributes of `Objectives` are now de-/serialized without leading
+- `_target(s)` attributes of `Objectives` are now (de-)serialized without leading
   underscore to support user-friendly serialization strings
 - Telemetry does not execute any code if it was disabled
 - Running simulations no longer alters the states of the global random number generators
@@ -439,7 +445,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `mypy` for targets and intervals
 - Tests for code blocks in README and user guides
 - `hypothesis` strategies and roundtrip tests for targets, intervals, and dataframes
-- De-/serialization of target subclasses via base class
+- (De-)serialization of target subclasses via base class
 - Docs building check now part of CI
 - Automatic formatting checks for code examples in documentation
 - Deserialization of classes with classmethod constructors can now be customized
@@ -464,7 +470,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Use pydoclint as flake8 plugin and not as a stand-alone linter
 - Margins in documentation for desktop and mobile version
 - `Interval`s can now also be deserialized from a bounds iterable
-- `SubspaceDiscrete` and `SubspaceContinuous` now have de-/serialization methods
+- `SubspaceDiscrete` and `SubspaceContinuous` now have (de-)serialization methods
 
 ### Removed
 - Conda install instructions and version badge

diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ Besides functionality to perform a typical recommend-measure loop, BayBE's highl
 - ⚙️ Custom surrogate models: Enhance your predictions through mechanistic understanding
 - 📈 Comprehensive backtest, simulation and imputation utilities: Benchmark and find your best settings
 - 📝 Fully typed and hypothesis-tested: Robust code base
-- 🔄 All objects are fully de-/serializable: Useful for storing results in databases or use in wrappers like APIs
+- 🔄 All objects are fully (de-)serializable: Useful for storing results in databases or use in wrappers like APIs
 
 
 ## ⚡ Quick Start

diff --git a/baybe/acquisition/base.py b/baybe/acquisition/base.py
@@ -159,7 +159,7 @@ def _get_botorch_acqf_class(
     )
 
 
-# Register de-/serialization hooks
+# Register (un-)structure hooks
 def _add_deprecation_hook(hook):
     """Add deprecation warnings to the default hook.
 

diff --git a/baybe/campaign.py b/baybe/campaign.py
@@ -4,23 +4,27 @@
 
 import gc
 import json
+from collections.abc import Collection
+from functools import reduce
 from typing import TYPE_CHECKING
 
 import cattrs
 import numpy as np
 import pandas as pd
-from attrs import define, field
+from attrs import define, evolve, field
 from attrs.converters import optional
 from attrs.validators import instance_of
 from typing_extensions import override
 
+from baybe.constraints.base import DiscreteConstraint
 from baybe.exceptions import IncompatibilityError
 from baybe.objectives.base import Objective, to_objective
 from baybe.parameters.base import Parameter
 from baybe.recommenders.base import RecommenderProtocol
 from baybe.recommenders.meta.base import MetaRecommender
 from baybe.recommenders.meta.sequential import TwoPhaseMetaRecommender
 from baybe.recommenders.pure.bayesian.base import BayesianRecommender
+from baybe.searchspace._annotated import AnnotatedSubspaceDiscrete
 from baybe.searchspace.core import (
     SearchSpace,
     SearchSpaceType,
@@ -35,12 +39,20 @@
     telemetry_record_recommended_measurement_percentage,
     telemetry_record_value,
 )
+from baybe.utils.basic import is_all_instance
 from baybe.utils.boolean import eq_dataframe
+from baybe.utils.dataframe import filter_df, fuzzy_row_match
 from baybe.utils.plotting import to_string
 
 if TYPE_CHECKING:
     from botorch.posteriors import Posterior
 
+# Metadata columns
+_RECOMMENDED = "recommended"
+_MEASURED = "measured"
+_EXCLUDED = "excluded"
+_METADATA_COLUMNS = [_RECOMMENDED, _MEASURED, _EXCLUDED]
+
 
 @define
 class Campaign(SerialMixin):
@@ -77,6 +89,9 @@ class Campaign(SerialMixin):
     """The employed recommender"""
 
     # Metadata
+    _searchspace_metadata: pd.DataFrame = field(init=False, eq=eq_dataframe)
+    """Metadata tracking the experimentation status of the search space."""
+
     n_batches_done: int = field(default=0, init=False)
     """The number of already processed batches."""
 
@@ -94,11 +109,44 @@ class Campaign(SerialMixin):
     )
     """The cached recommendations."""
 
+    @_searchspace_metadata.default
+    def _default_searchspace_metadata(self) -> pd.DataFrame:
+        """Create a fresh metadata object."""
+        df = pd.DataFrame(
+            False,
+            index=self.searchspace.discrete.exp_rep.index,
+            columns=_METADATA_COLUMNS,
+        )
+        df.loc[:, _EXCLUDED] = self.searchspace.discrete._excluded
+        return df
+
     @override
     def __str__(self) -> str:
+        recommended_count = sum(self._searchspace_metadata[_RECOMMENDED])
+        measured_count = sum(self._searchspace_metadata[_MEASURED])
+        excluded_count = sum(self._searchspace_metadata[_EXCLUDED])
+        n_elements = len(self._searchspace_metadata)
+        searchspace_fields = [
+            to_string(
+                "Recommended:",
+                f"{recommended_count}/{n_elements}",
+                single_line=True,
+            ),
+            to_string(
+                "Measured:",
+                f"{measured_count}/{n_elements}",
+                single_line=True,
+            ),
+            to_string(
+                "Excluded:",
+                f"{excluded_count}/{n_elements}",
+                single_line=True,
+            ),
+        ]
         metadata_fields = [
             to_string("Batches done", self.n_batches_done, single_line=True),
             to_string("Fits done", self.n_fits_done, single_line=True),
+            to_string("Discrete Subspace Meta Data", *searchspace_fields),
         ]
         metadata = to_string("Meta Data", *metadata_fields)
         fields = [metadata, self.searchspace, self.objective, self.recommender]
@@ -196,13 +244,6 @@ def add_measurements(
                     f" the provided dataframe."
                 )
 
-        # Update meta data
-        # TODO: refactor responsibilities
-        if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
-            self.searchspace.discrete.mark_as_measured(
-                data, numerical_measurements_must_be_within_tolerance
-            )
-
         # Read in measurements and add them to the database
         self.n_batches_done += 1
         to_insert = data.copy()
@@ -213,6 +254,16 @@ def add_measurements(
             [self._measurements_exp, to_insert], axis=0, ignore_index=True
         )
 
+        # Update metadata
+        if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
+            idxs_matched = fuzzy_row_match(
+                self.searchspace.discrete.exp_rep,
+                data,
+                self.parameters,
+                numerical_measurements_must_be_within_tolerance,
+            )
+            self._searchspace_metadata.loc[idxs_matched, _MEASURED] = True
+
         # Telemetry
         telemetry_record_value(TELEM_LABELS["COUNT_ADD_RESULTS"], 1)
         telemetry_record_recommended_measurement_percentage(
@@ -222,6 +273,65 @@ def add_measurements(
             numerical_measurements_must_be_within_tolerance,
         )
 
+    def toggle_discrete_candidates(  # noqa: DOC501
+        self,
+        constraints: Collection[DiscreteConstraint] | pd.DataFrame,
+        exclude: bool,
+        complement: bool = False,
+        dry_run: bool = False,
+    ) -> pd.DataFrame:
+        """In-/exclude certain discrete points in/from the candidate set.
+
+        Args:
+            constraints: A filtering mechanism determining the candidates subset to be
+                in-/excluded. Can be either a collection of
+                :class:`~baybe.constraints.base.DiscreteConstraint` or a dataframe.
+                For the latter, see :func:`~baybe.utils.dataframe.filter_df`
+                for details.
+            exclude: If ``True``, the specified candidates are excluded.
+                If ``False``, the candidates are considered for recommendation.
+            complement: If ``True``, the filtering mechanism is inverted so that
+                the complement of the candidate subset specified by the filter is
+                toggled. For details, see :func:`~baybe.utils.dataframe.filter_df`.
+            dry_run: If ``True``, the target subset is only extracted but not
+                affected. If ``False``, the candidate set is updated correspondingly.
+                Useful for setting up the correct filtering mechanism.
+
+        Returns:
+            A new dataframe containing the  discrete candidate set passing through the
+            specified filter.
+        """
+        df = self.searchspace.discrete.exp_rep
+
+        if isinstance(constraints, pd.DataFrame):
+            # Determine the candidate subset to be toggled
+            points = filter_df(df, constraints, complement)
+
+        elif isinstance(constraints, Collection) and is_all_instance(
+            constraints, DiscreteConstraint
+        ):
+            # TODO: Should be taken over by upcoming `SubspaceDiscrete.filter` method,
+            #   automatically choosing the appropriate backend (polars/pandas/...)
+
+            # Filter the search space dataframe according to the given constraint
+            idx = reduce(
+                lambda x, y: x.intersection(y), (c.get_valid(df) for c in constraints)
+            )
+
+            # Determine the candidate subset to be toggled
+            points = df.drop(index=idx) if complement else df.loc[idx].copy()
+
+        else:
+            raise TypeError(
+                "Candidate toggling is not implemented for the given type of "
+                "constraint specifications."
+            )
+
+        if not dry_run:
+            self._searchspace_metadata.loc[points.index, _EXCLUDED] = exclude
+
+        return points
+
     def recommend(
         self,
         batch_size: int,
@@ -260,10 +370,18 @@ def recommend(
             self.n_fits_done += 1
             self._measurements_exp.fillna({"FitNr": self.n_fits_done}, inplace=True)
 
+        # Prepare the search space according to the current campaign state
+        annotated_searchspace = evolve(
+            self.searchspace,
+            discrete=AnnotatedSubspaceDiscrete.from_subspace(
+                self.searchspace.discrete, self._searchspace_metadata
+            ),
+        )
+
         # Get the recommended search space entries
         rec = self.recommender.recommend(
             batch_size,
-            self.searchspace,
+            annotated_searchspace,
             self.objective,
             self._measurements_exp,
             pending_experiments,
@@ -272,6 +390,10 @@ def recommend(
         # Cache the recommendations
         self._cached_recommendation = rec.copy()
 
+        # Update metadata
+        if self.searchspace.type in (SearchSpaceType.DISCRETE, SearchSpaceType.HYBRID):
+            self._searchspace_metadata.loc[rec.index, _RECOMMENDED] = True
+
         # Telemetry
         telemetry_record_value(TELEM_LABELS["COUNT_RECOMMEND"], 1)
         telemetry_record_value(TELEM_LABELS["BATCH_SIZE"], batch_size)
@@ -361,7 +483,7 @@ def _drop_version(dict_: dict) -> dict:
     return dict_
 
 
-# Register de-/serialization hooks
+# Register (un-)structure hooks
 unstructure_hook = cattrs.gen.make_dict_unstructure_fn(
     Campaign, converter, _cattrs_include_init_false=True
 )

diff --git a/baybe/constraints/base.py b/baybe/constraints/base.py
@@ -93,17 +93,29 @@ class DiscreteConstraint(Constraint, ABC):
     eval_during_modeling: ClassVar[bool] = False
     # See base class.
 
+    def get_valid(self, df: pd.DataFrame, /) -> pd.Index:
+        """Get the indices of dataframe entries that are valid under the constraint.
+
+        Args:
+            df: A dataframe where each row represents a parameter configuration.
+
+        Returns:
+            The dataframe indices of rows that fulfill the constraint.
+        """
+        invalid = self.get_invalid(df)
+        return df.index.drop(invalid)
+
     @abstractmethod
     def get_invalid(self, data: pd.DataFrame) -> pd.Index:
         """Get the indices of dataframe entries that are invalid under the constraint.
 
         Args:
-            data: A dataframe where each row represents a particular parameter
-                combination.
+            data: A dataframe where each row represents a parameter configuration.
 
         Returns:
-            The dataframe indices of rows where the constraint is violated.
+            The dataframe indices of rows that violate the constraint.
         """
+        # TODO: Should switch backends (pandas/polars/...) behind the scenes
 
     def get_invalid_polars(self) -> pl.Expr:
         """Translate the constraint to Polars expression identifying undesired rows.

diff --git a/baybe/kernels/base.py b/baybe/kernels/base.py
@@ -125,7 +125,7 @@ class CompositeKernel(Kernel, ABC):
     """Abstract base class for all composite kernels."""
 
 
-# Register de-/serialization hooks
+# Register (un-)structure hooks
 converter.register_structure_hook(Kernel, get_base_structure_hook(Kernel))
 converter.register_unstructure_hook(Kernel, unstructure_base)
 

diff --git a/baybe/objectives/base.py b/baybe/objectives/base.py
@@ -60,7 +60,7 @@ def to_objective(x: Target | Objective, /) -> Objective:
     return x if isinstance(x, Objective) else x.to_objective()
 
 
-# Register de-/serialization hooks
+# Register (un-)structure hooks
 converter.register_structure_hook(Objective, structure_objective)
 converter.register_unstructure_hook(
     Objective,