Online Isolation Forest [1]
The anomaly detection literature is abundant with offline methods, which require repeated access to data in memory, and impose impractical as sumptions when applied to a streaming context. Online Isolation Forest is an anomaly detection algorithm explicitly designed for streaming conditions, and it seamlessly tracks the data generating process as it evolves over time.
In the image above we illustrate the online learning capabilities of Online Isolation Forest with a toy example. Genuine data, depicted in green, are more densely distributed than anomalous data, represented in red. Online Isolation Forest processes points one at a time (i.e., in a streaming fashion), and assigns an anomaly score to each of them. As the stream continues, Online Isolation Forest acquires more information about the data distribution and refines the estimate of the anomaly scores accordingly.
Online Isolation Forest is a forest of d-dimensional multi-resolution histograms constructed by recursively
splitting the input space
When a new sample
The learning procedure is repeated until the window
In contrast to the learning procedure, which involves generating new nodes and thereby enhancing the resolution of Online Isolation Tree in that area, in the forgetting procedure we eventually aggregate nodes and merge the associated bins, ultimately reducing the histogram resolution in the corresponding region of the space. The image below illustrates the forgetting procedure, detailed in [1].
In this repository you can find the datasets and scripts used to
benchmark Online Isolation Forest and competing methods in [1].
This repository also includes scripts used to
plot results, as well as the results and figures used in
[1].
Please note that due to the large size of some results, we have used git-LFS. You will need to install git-LFS to correctly clone the repository.
Please be aware that to correctly run the scripts used for benchmarking, you must adjust the pysad implementation of Isolation Forest-ASD to allow for a different number of tree estimators other than the default.
Specifically, you need to modify the following lines in the filepython3.10/site-packages/pysad/models/iforest_asd.py
from:def __init__(self, initial_window_X=None, window_size=2048): super().__init__(IForest, window_size, window_size, initial_window_X)
to:
def __init__(self, initial_window_X=None, window_size=2048, **kwargs): super().__init__(IForest, window_size, window_size, initial_window_X, **kwargs)
This folder contains a Python implementation of the Online Isolation Forest algorithm.
Additionally, you can find a demo of the algorithm in this file.
In order to play with the demo you just need to:
- Clone the repo locally.
- Install dependencies listed in the requirements file.
- Run Online-iForest_demo.
Online Isolation Forest is part of the new release of CapyMOA Machine learning library tailored for data streams!!! There you can find Installation instructions, Anomaly Detection tutorials and Online Isolation Forest documentation.
[1]
If you find Online Isolation Forest useful in your scientific publication, we would appreciate using the following
citation:
@inproceedings{Leveni2024,
title = {Online Isolation Forest},
author = {Leveni, Filippo and Weigert Cassales, Guilherme and Pfahringer, Bernhard and Bifet, Albert and Boracchi, Giacomo},
booktitle = {Proceedings of the 41st International Conference on Machine Learning (ICML)},
volume = {235},
pages = {27288--27298},
year = {2024},
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
series = {Proceedings of Machine Learning Research (PMLR)},
month = {21--27 Jul},
url = {https://proceedings.mlr.press/v235/leveni24a.html},
}