Skip to content

CausalFlow: a Collection of Methods for Causal Discovery from Time-series

License

Notifications You must be signed in to change notification settings

lcastri/causalflow

Repository files navigation

CausalFlow: a Collection of Methods for Causal Discovery from Time-series

PyPI version Documentation Status

CausalFlow is a python library for causal analysis from time-series data. It comprises:

  • F-PCMCI - Filtered-PCMCI
  • CAnDOIT - CAusal Discovery with Observational and Interventional data from Time-series
  • RandomGraph
  • Other causal discovery methods all within the same framework

Useful links

  • F-PCMCI:
    L. Castri, S. Mghames, M. Hanheide and N. Bellotto (2023).
    Enhancing Causal Discovery from Robot Sensor Data in Dynamic Scenarios,
    Proceedings of the Conference on Causal Learning and Reasoning (CLeaR).
    @inproceedings{castri2023enhancing,
      title={Enhancing Causal Discovery from Robot Sensor Data in Dynamic Scenarios},
      author={Castri, Luca and Mghames, Sariah and Hanheide, Marc and Bellotto, Nicola},
      booktitle={Conference on Causal Learning and Reasoning},
      pages={243--258},
      year={2023},
      organization={PMLR}
    }
    
  • CAnDOIT:
    L. Castri, S. Mghames, M. Hanheide and N. Bellotto (2024).
    CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series,
    Advanced Intelligent Systems.
    @article{https://doi.org/10.1002/aisy.202400181,
      author = {Castri, Luca and Mghames, Sariah and Hanheide, Marc and Bellotto, Nicola},
      title = {CAnDOIT: Causal Discovery with Observational and Interventional Data from Time Series},
      journal = {Advanced Intelligent Systems},
      volume = {n/a},
      number = {n/a},
      pages = {2400181},
      keywords = {causal robotics, observations and interventions-based causal discoveries, time series},
      doi = {https://doi.org/10.1002/aisy.202400181},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/aisy.202400181},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/aisy.202400181},
    }
    
  • Tutorials [Coming soon..]

F-PCMCI

Extension of the state-of-the-art causal discovery method PCMCI, augmented with a feature-selection method based on Transfer Entropy. The algorithm, starting from a prefixed set of variables, identifies the correct subset of features and a hypothetical causal model between them. Then, using the selected features and the hypothetical causal model, the causal discovery is executed. This refined set of variables and the list of potential causal links between them contribute to achieving faster and more accurate causal discovery.

In the following, an example demonstrating the main functionality of F-PCMCI is presented, along with a comparison between causal models obtained by PCMCI and F-PCMCI causal discovery algorithms using the same data. The dataset consists of a 7-variables system defined as follows:

$$ \begin{aligned} X_0(t) &= 2X_1(t-1) + 3X_3(t-1) + \eta_0 \\ X_1(t) &= \eta_1 \\ X_2(t) &= 1.1(X_1(t-1))^2 + \eta_2 \\ X_3(t) &= X_3(t-1) \cdot X_2(t-1) + \eta_3 \\ X_4(t) &= X_4(t-1) + X_5(t-1) \cdot X_0(t-1) + \eta_4 \\ X_5(t) &= \eta_5 \\ X_6(t) &= \eta_6 \end{aligned} $$

min_lag = 1
max_lag = 1
np.random.seed(1)
nsample = 1500
nfeature = 7

d = np.random.random(size = (nsample, feature))
for t in range(max_lag, nsample):
  d[t, 0] += 2 * d[t-1, 1] + 3 * d[t-1, 3]
  d[t, 2] += 1.1 * d[t-1, 1]**2
  d[t, 3] += d[t-1, 3] * d[t-1, 2]
  d[t, 4] += d[t-1, 4] + d[t-1, 5] * d[t-1, 0]
Causal Model by PCMCI Causal Model by F-PCMCI
Execution time ~ 8min 40sec Execution time ~ 3min 00sec

F-PCMCI removes the variable $X_6$ from the causal graph (since isolated), and generate the correct causal model. In contrast, PCMCI retains $X_6$ leading to the wrong causal structure. Specifically, a spurious link $X_6$ -> $X_5$ appears in the causal graph derived by PCMCI.

CAnDOIT

CAnDOIT extends LPCMCI, allowing the incorporation of interventional data into the causal discovery process alongside observational data. Like its predecessor, CAnDOIT can handle both lagged and contemporaneous dependencies, as well as latent variables.

Example

In the following example, taken from one of the tigramite tutorials (this), we demonstrate CAnDOIT's ability to incorporate and leverage interventional data to improve the accuracy of causal analysis. The example involves a system of equations with four variables:

$$ \begin{aligned} X_0(t) &= 0.9X_0(t-1) + 0.6X_1(t) + \eta_0 \\ L_1(t) &= \eta_1 \\ X_2(t) &= 0.9X_2(t-1) + 0.4X_1(t-1) + \eta_2 \\ X_3(t) &= 0.9X_3(t-1) - 0.5X_2(t-2) + \eta_3 \\ \end{aligned} $$

Note that $L_1$ is a latent confounder of $X_0$ and $X_2$. This system of equations generates the time-series data in the observational domain, which is then used by LPCMCI for causal discovery analysis.

tau_max = 2
pc_alpha = 0.05
np.random.seed(19)
nsample_obs = 500
nfeature = 4

d = np.random.random(size = (nsample_obs, nfeature))
for t in range(tau_max, nsample_obs):
  d[t, 0] += 0.9 * d[t-1, 0] + 0.6 * d[t, 1]
  d[t, 2] += 0.9 * d[t-1, 2] + 0.4 * d[t-1, 1]
  d[t, 3] += 0.9 * d[t-1, 3] - 0.5 * d[t-2, 2]

# Remove the unobserved component time series
data_obs = d[:, [0, 2, 3]]

var_names = ['X_0', 'X_2', 'X_3']
d_obs = Data(data_obs, vars = var_names)
d_obs.plot_timeseries()

lpcmci = LPCMCI(d_obs,
                min_lag = 0,
                max_lag = tau_max,
                val_condtest = ParCorr(significance='analytic'),
                alpha = pc_alpha)

# Run LPCMCI
lpcmci_cm = lpcmci.run()
lpcmci_cm.ts_dag(node_size = 4, min_width = 1.5, max_width = 1.5, 
                 x_disp=0.5, y_disp=0.2, font_size=10)
Observational Data Causal Model by LPCMCI

As you can see from LPCMCI's result, the method correctly identifies the bidirected link (indicating the presence of a latent confounder) between $X_0$ and $X_2$. However, the final causal model presents uncertainty regarding the link $X_2$ o-> $X_3$. Specifically, the final causal model is a PAG that represents two MAGs: the first with $X_2$ <-> $X_3$, and the second with $X_2$ -> $X_3$.

Now, let's introduce interventional data and examine its benefits. In this case, we perform a hard intervention on the variable $X_2$, meaning we replace its equation with a constant value corresponding to the intervention (in this case, $X_2 = 3$).

nsample_int = 150
int_data = dict()

# Intervention on X_2.
d_int = np.random.random(size = (nsample_int, nfeature))
d_int[0:tau_max, :] = d[len(d)-tau_max:,:]
d_int[:, 2] = 3 * np.ones(shape = (nsample_int,)) 
for t in range(tau_max, nsample_int):
    d_int[t, 0] += 0.9 * d_int[t-1, 0] + 0.6 * d_int[t, 1]
    d_int[t, 3] += 0.9 * d_int[t-1, 3] - 0.5 * d_int[t-2, 2]
        
data_int = d_int[:, [0, 2, 3]]
df_int = Data(data_int, vars = var_names)
int_data['X_2'] =  df_int

candoit = CAnDOIT(d_obs, 
                  int_data,
                  alpha = pc_alpha, 
                  min_lag = 0, 
                  max_lag = tau_max, 
                  val_condtest = ParCorr(significance='analytic'))
    
candoit_cm = candoit.run()
candoit_cm.ts_dag(node_size = 4, min_width = 1.5, max_width = 1.5, 
                  x_disp=0.5, y_disp=0.2, font_size=10)
Observational & Interventional Data Causal Model by CAnDOIT

CAnDOIT, like LPCMCI, correctly detects the bidirected link $X_0$ <-> $X_2$. Additionally, by incorporating interventional data, CAnDOIT resolves the uncertainty regarding the link $X_2$ o-> $X_3$, resulting in a reduction of the PAG size. Specifically, the PAG found by CAnDOIT is the representaion of only one MAG.

Robotics application of CAnDOIT

In this section, we discuss an application of CAnDOIT in a robotic scenario. We designed an experiment to learn the causal model in a hypothetical robot arm application equipped with a camera. For this application, we utilised Causal World, which models a TriFinger robot, a floor, and a stage.

In our case, we use only one finger of the robot, with the finger's end effector equipped with a camera. The scenario consists of a cube placed at the centre of the floor, surrounded by a white stage. The colour's brightness ($b$) of the cube and the floor is modelled as a function of the end-effector height ($H$), its absolute velocity ($v$), and the distance between the end-effector and the cube $d_c$. This model captures the shading and blurring effects on the cube. In contrast, the floor, being darker and larger than the cube, is only affected by the end effector's height.

Note that $H$, $v$, and $d_c$ are obtained directly from the simulator and not explicitly modelled, while the ground-truth structural causal model for the floor colour ($F_c$) and cube colour ($C_c$) is expressed as follows:

$$ \begin{aligned} F_c(t) &= b(H(t-1))\\ C_c(t) &= b(H(t-1), v(t-1), d_c(t-1)) \end{aligned} $$

This model is used to generate observational data, which is then used by LPCMCI and CAnDOIT to reconstruct the causal model. For the interventional domain instead, we substitute the equation modelling $F_c$ with a constant colour (green) and collect the data for the causal analysis conducted by CAnDOIT. Note that, for both the obervational and interventional domains, $H$ is considered as latent confounder between $F_c$ and $C_c$.

Observational dataset Interventional dataset
Ground-truth Causal Model Causal Model by LPCMCI Causal Model by CAnDOIT

Also in this experiment, we can see the benefit of using intervention data alongside the observations. LPCMCI is unable to orient the contemporaneous (spurious) link between $F_c$ and $C_c$ due to the hidden confounder $H$. This results in the ambiguous link $F_c$ o-o $C_c$, which does not encode the correct link <->. Instead CAnDOIT, using interventional data, correctly identifies the bidirected link $F_c$ <-> $C_c$, decreasing once again the uncertainty level and increasing the accuracy of the reconstructed causal model.

RandomGraph

RandomGraph is a random-model generator capable of creating random systems of equations with various properties: linear, nonlinear, lagged and/or contemporaneous dependencies, and hidden confounders. This tool offers several adjustable parameters, listed as follows:

  • time-series length;
  • number of observable variables;
  • number of observable parents per variable (link density);
  • number of hidden confounders;
  • number of confounded variables per hidden confounder;
  • noise configuration, e.g. Gaussian noise $\mathcal{N}(\mu, \sigma^2)$;
  • minimum $\tau_{min}$ and maximum $\tau_{max}$ time delay to consider in the equations;
  • coefficient range of the equations' terms;
  • functional forms applied to the equations' terms: $[-, \sin, \cos, \text{abs}, \text{pow}, \text{exp}]$, where $-$ stands for none;
  • operators used to link various equations terms: $[+, -, *, /]$.

RandomGraph outputs a graph, the associated system of equations, and observational data. Additionally, it provides the option to generate interventional data.

Example - Linear Random Graph

noise_uniform = (NoiseType.Uniform, -0.5, 0.5)
noise_gaussian = (NoiseType.Gaussian, 0, 1)
noise_weibull = (NoiseType.Weibull, 2, 1)
RG = RandomGraph(nvars = 5, 
                 nsamples = 1000, 
                 link_density = 3, 
                 coeff_range = (0.1, 0.5), 
                 max_exp = 2, 
                 min_lag = 0, 
                 max_lag = 3, 
                 noise_config = random.choice([noise_uniform, noise_gaussian, noise_weibull]),
                 functions = [''], 
                 operators = ['+', '-'], 
                 n_hidden_confounders = 2)
RG.gen_equations()
RG.ts_dag(withHidden = True)

$$ \begin{aligned} X_0(t)&=0.44X_1(t-1) - 0.15X_0(t-2) + 0.1X_4(t-3) + 0.33H_0(t-3) - 0.11H_1(t-2)\\ X_1(t)&=0.13X_2(t-2) + 0.19H_0(t-3) + 0.46H_1(t-3)\\ X_2(t)&=0.21X_4(t-3) + 0.37H_1(t)\\ X_3(t)&=0.23X_0(t-2) - 0.44H_0(t-3) - 0.17H_1(t-3)\\ X_4(t)&=0.47X_1(t-2) + 0.23X_0(t-3) + 0.49X_4(t-1) + 0.49H_0(t-3) - 0.27H_1(t-2)\\ H_0(t)&=0.1X_4(t-2)\\ H_1(t)&=0.44X_3(t)\\ \end{aligned} $$

Example - Nonlinear Random Graph

noise_uniform = (NoiseType.Uniform, -0.5, 0.5)
noise_gaussian = (NoiseType.Gaussian, 0, 1)
noise_weibull = (NoiseType.Weibull, 2, 1)
RG = RandomGraph(nvars = 5, 
                 nsamples = 1000, 
                 link_density = 3, 
                 coeff_range = (0.1, 0.5), 
                 max_exp = 2, 
                 min_lag = 0, 
                 max_lag = 3, 
                 noise_config = random.choice([noise_uniform, noise_gaussian, noise_weibull]),
                 functions = ['','sin', 'cos', 'exp', 'abs', 'pow'], 
                 operators = ['+', '-', '*', '/'], 
                 n_hidden_confounders = 2)
RG.gen_equations()
RG.ts_dag(withHidden = True)

$$ \begin{aligned} X_0(t)&=0.48\frac{\cos(X_4)(t-3)}{0.12\sin(H_1)(t-3)}\\ X_1(t)&=0.17\sin(X_4)(t-3) - 0.46\cos(X_3)(t) + 0.14|H_1|(t-3)\\ X_2(t)&=\frac{0.32X_4(t-1)}{0.2X_2(t-1)} + 0.23|X_3|(t-2) - 0.34e^{H_1}(t-3)\\ X_3(t)&=0.1|X_1|(t-1) \cdot 0.26\sin(X_2)(t) \cdot 0.4\cos(X_0)(t-2) - 0.2\cos(H_0)(t-2)\\ X_4(t)&=0.24|X_1|(t-3) - 0.43X_3(t) + 0.31\sin(H_0)(t-3) + 0.21H_1(t-3)\\ H_0(t)&=0.45|X_3|(t-2)\\ H_1(t)&=\frac{0.32H_0(t-1)}{0.35e^{H_1}(t-3)} \cdot 0.4X_4(t-3)\\ \end{aligned} $$

Linear Random Graph Nonlinear Random Graph
Linear model Nonlinear model
Lagged dependencies Lagged dependencies
Contemporaneous dependencies Contemporaneous dependencies
2 hidden confounders 2 hidden confounders

Example - Random Graph with Interventional Data

noise_gaussian = (NoiseType.Gaussian, 0, 1)
RS = RandomGraph(nvars = 5, 
                 nsamples = 1500, 
                 link_density = 3, 
                 coeff_range = (0.1, 0.5), 
                 max_exp = 2, 
                 min_lag = 0, 
                 max_lag = 3, 
                 noise_config = noise_gaussian,
                 functions = ['','sin', 'cos', 'exp', 'abs', 'pow'], 
                 operators = ['+', '-', '*', '/'], 
                 n_hidden_confounders = 2)
RS.gen_equations()

d_obs_wH, d_obs = RS.gen_obs_ts()
d_obs.plot_timeseries()

d_int = RS.intervene('X_4', 250, random.uniform(5, 10), d_obs.d)
d_int['X_4'].plot_timeseries()
Observational Data Interventional Data

Other Causal Discovery Algorithms

Although the main contribution of this repository is to present the CAnDOIT and F-PCMCI algorithms, other causal discovery methods have been included for benchmarking purposes. Consequently, CausalFlow offers a collection of causal discovery methods, beyond F-PCMCI and CAnDOIT, that output time-series graphs (graphs that specify the lag for each link). These methods are listed as follows:

Some algorithms are imported from other languages such as R and Java and are then wrapped in Python. Having the majority of causal discovery methods integrated into a single framework, which handles various types of inputs and outputs causal models, can facilitate the use of these algorithms.

Algorithm Observations Feature Selection Interventions
DYNOTEARS
PCMCI
PCMCI+
LPCMCI
J-PCMCI+
TCDF
tsFCI
VarLiNGAM
F-PCMCI
CAnDOIT

Citation

Please consider citing the following papers depending on which method you use:

  • F-PCMCI:
    L. Castri, S. Mghames, M. Hanheide and N. Bellotto (2023).
    Enhancing Causal Discovery from Robot Sensor Data in Dynamic Scenarios,
    Proceedings of the Conference on Causal Learning and Reasoning (CLeaR).
    @inproceedings{castri2023enhancing,
      title={Enhancing Causal Discovery from Robot Sensor Data in Dynamic Scenarios},
      author={Castri, Luca and Mghames, Sariah and Hanheide, Marc and Bellotto, Nicola},
      booktitle={Conference on Causal Learning and Reasoning},
      pages={243--258},
      year={2023},
      organization={PMLR}
    }
    
  • CAnDOIT:
    L. Castri, S. Mghames, M. Hanheide and N. Bellotto (2024).
    CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series,
    Advanced Intelligent Systems.
    @article{https://doi.org/10.1002/aisy.202400181,
      author = {Castri, Luca and Mghames, Sariah and Hanheide, Marc and Bellotto, Nicola},
      title = {CAnDOIT: Causal Discovery with Observational and Interventional Data from Time Series},
      journal = {Advanced Intelligent Systems},
      volume = {n/a},
      number = {n/a},
      pages = {2400181},
      keywords = {causal robotics, observations and interventions-based causal discoveries, time series},
      doi = {https://doi.org/10.1002/aisy.202400181},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/aisy.202400181},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/aisy.202400181},
    }
    

Requirements

  • pandas>=1.5.2
  • numba>=0.58.1
  • scipy>=1.3.3
  • networkx>=2.8.6
  • ruptures>=1.1.7
  • scikit_learn>=1.1.3
  • torch>=1.11.0
  • gpytorch>=1.4
  • dcor>=0.5.3
  • h5py>=3.7.0
  • jpype1>=1.5.0
  • mpmath>=1.3.0
  • causalnex
  • lingam
  • pyopencl>=2024.1
  • matplotlib>=3.7.0
  • numpy
  • pgmpy>=0.1.19
  • tigramite>=5.1.0.3
  • rectangle-packer
  • grandalf

Installation

Before installing CausalFlow, you need to install Java and the IDTxl package used for the feature-selection process, following the guide described here. Once complete, you can install the current release of CausalFlow with:

pip install py-causalflow

For a complete installation Java - IDTxl - CausalFlow, follow the following procedure.

1 - Java installation

Verify that you have not already installed Java:

java -version

if the latter returns Command 'java' not found, ..., you can install Java by the following commands, otherwise you can jump to IDTxl installation.

# Java
sudo apt-get update
sudo apt install default-jdk

Then, you need to add JAVA_HOME to the environment

sudo nano /etc/environment
JAVA_HOME="/lib/jvm/java-11-openjdk-amd64/bin/java" # Paste the JAVA_HOME assignment at the bottom of the file
source /etc/environment

2 - IDTxl installation

# IDTxl
git clone -b v1.4 https://github.com/pwollstadt/IDTxl.git
cd IDTxl
pip install -e .

3 - CausalFlow installation

pip install py-causalflow

Recent changes

Version Changes
4.0.4 IDTxl v1.4
4.0.3 numba version fix
DAG dag() fix
CAnDOIT fix: min_lag must be equal to 0
4.0.2 PyPI fixes
rectangle-packer and grandalf added to requirements
numba version fix
causal_discovery/baseline/pkgs fix
4.0.1 PyPI
4.0.0 package published