Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: release notes 0.38.0 #10231

Merged
merged 4 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions docs/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,181 @@
Release Notes
###############

**************
Version 0.38
**************

Version 0.38.0
==============

**Release Date:** November 22, 2024

**Breaking Changes**

- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and
``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code
must report the configured ``time_metric`` in validation metrics. As a convenience, Determined
training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can
use as your ``time_metric``. ASHA experiments without this modification will no longer run.

- Custom Searchers: All custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0``
and are now being removed. Users are encouraged to use a preset searcher, which can be easily
:ref:`configured <experiment-configuration_searcher>` for any experiment.

- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed.
We will maintain first-class support for a variety of preset searchers, which can be easily
configured for any experiment. Visit :ref:`search-methods` for details.

**New Features**

- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens
for users to authenticate in automated workflows. Users can define the lifespan of these tokens,
making it easier to securely authenticate and run processes. Users can set global defaults and
limits for the validity of access tokens by configuring ``default_lifespan_days`` and
``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1``
indicates an **infinite** lifespan for the access token. This feature enhances automation while
maintaining strong security protocols by allowing tighter control over token usage and
expiration. This feature requires Determined Enterprise Edition.

- CLI:

- ``det token create``: Create a new access token.
- ``det token login``: Sign in with an access token.
- ``det token edit``: Update an access token's description.
- ``det token list``: List all active access tokens, with options for displaying revoked
tokens.
- ``det token describe``: Show details of specific access tokens.
- ``det token revoke``: Revoke an access token.

- API:

- ``POST /api/v1/tokens``: Create a new access token.
- ``GET /api/v1/tokens``: Retrieve a list of access tokens.
- ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token.

- API: Introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that
integrates Keras training code with Determined through a single :ref:`Keras Callback
<api-keras-ug>`.

- API: Introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that
allows for Python-side training loop configurations and includes support for local training.

- Cluster: In the enterprise edition of Determined, add :ref:`config policies <config-policies>` to
enable administrators to set limits on how users can define workloads (e.g., experiments,
notebooks, TensorBoards, shells, and commands). Administrators can define two types of
configurations:

- **Invariant Configs for Experiments**: Settings applied to all experiments within a specific
scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards,
shells, and commands) is not yet supported.

- **Constraints**: Restrictions that prevent users from exceeding resource limits within a
scope. Constraints can be set independently for experiments and tasks.

- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and
``determined_master_scheme``. These control how tasks address the Determined API server and are
useful when installations span multiple Kubernetes clusters or there are proxies in between tasks
and the master. Also, ``determined_master_host`` now defaults to the service host,
``<det_namespace>.<det_service_name>.svc.cluster.local``, instead of the service IP.

- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit
:ref:`helm-config-reference` for more details.

- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which
allows users to create, view, and revoke their own :ref:`access tokens <access-tokens>`. This
role can only be assigned globally.

- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows
as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in
both the run table and run detail views.

In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string.
For more details, refer to :ref:`log_policies <config-log-policies>`.

**Improvements**

- Master Configuration: Add support for crypto system configuration for ssh connection.
``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed
from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the
old default, and ``ED25519`` is also the default key type for ``ssh-keygen``.

**Removed Features**

- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and
will simply resume the trial from the last checkpoint.

**Known Issues**

- PyTorch has `deprecated
<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html#use-tensorboard-to-view-results-and-analyze-model-performance>`
their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with
PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are
experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier
than 2.0.

**Bug Fixes**

- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter
(that is, just an empty map), that hyperparameter would not appear in the hparams passed to the
trial.

**Deprecations**

- Experiment Config: The ``max_length`` field of the searcher configuration section has been
deprecated for all experiments and searchers. Users are expected to configure the desired
training length directly in training code.

- Experiment Config: The ``optimizations`` config has been deprecated. Please see :ref:`Training
APIs <apis-howto-overview>` to configure supported optimizations through training code directly.

- Experiment Config: The ``scheduling_unit``, ``min_checkpoint_period``, and
``min_validation_period`` config fields have been deprecated. Instead, these configuration
options should be specified in training code.

- Experiment Config: The ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial
definitions. Please invoke your training script directly (``python3 train.py``).

- Core API: The ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no
longer requires ``core.searcher.operations`` to run, and progress should be reported through
``core.train.report_progress``.

- DeepSpeed: The ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes
on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and
``get_num_micro_batches_per_slot()``.

- Horovod: The Horovod distributed training backend has been deprecated. Users are encouraged to
migrate to the native distributed backend of their training framework (``torch.distributed`` or
``tf.distribute``).

- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new
:ref:`Keras Callback <api-keras-ug>`.

- Launchers: The ``--trial`` argument in Determined launchers has been deprecated. Please invoke
your training script directly.

- ASHA: The ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated.
All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based.

- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All
training APIs now support local execution (``python3 train.py``). Please see ``training apis``
for details specific to your framework.

- Web UI: Previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web
UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback
option.

- Database: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det
deploy aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which
uses Amazon RDS for PostgreSQL. We recommend that users migrate to Amazon RDS for PostgreSQL. For
more information, visit the `migration instructions
<https://gist.github.com/maxrussell/c67f4f7d586d55c4eb2658cc2dd1c290>`_.

- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November
14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later
to maintain compatibility. The application will log a warning if it detects a connection to any
PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once
it is End of Life.

**************
Version 0.37
**************
Expand Down
7 changes: 0 additions & 7 deletions docs/release-notes/9966-fix-grid.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/release-notes/add-host-port-scheme-to-helm.rst

This file was deleted.

28 changes: 0 additions & 28 deletions docs/release-notes/api-cli-access-token.rst

This file was deleted.

15 changes: 0 additions & 15 deletions docs/release-notes/config-policies.rst

This file was deleted.

6 changes: 0 additions & 6 deletions docs/release-notes/helm-db-snapshot.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/release-notes/log-signal.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/release-notes/pytorch-tensorboard-plugin.rst

This file was deleted.

7 changes: 0 additions & 7 deletions docs/release-notes/rbac-new-tokenCreator-role.rst

This file was deleted.

7 changes: 0 additions & 7 deletions docs/release-notes/remove-custom-searcher.rst

This file was deleted.

72 changes: 0 additions & 72 deletions docs/release-notes/searcher-context-removal.rst

This file was deleted.

8 changes: 0 additions & 8 deletions docs/release-notes/ssh-crypto-system.rst

This file was deleted.

19 changes: 0 additions & 19 deletions docs/release-notes/unsupport-aurora-postgres-reminder.rst

This file was deleted.

Loading