Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi everyone,
Here is a bit more work on the idea of using a recurrent neural network to simulate the timeseries produced by Kato et al. (once again, inspired by Karpathy's char-rnn)
Previously, I constructed a LSTM recurrent neural network to generate timeseries data, using Kato data as an input to train the model. @jrieke helpfully suggested using a Gaussian Mixture Model (GMM) to avoid the LSTM getting stuck in a local minima. My last commit attempted to construct such a model, however the model ran into numerical errors and would not converge.
I attempted additional work on the GMM model, but still was generating only NaNs when attempting to train the model. I concluded that while a GMM model is likely the best implementation, debugging the model further is over my head at this point. I welcome any further feedback on how to avoid the numerical errors.
Therefore, I returned to the simpler, "vanilla" LSTM, which I now share in this commit.
The current model, which can be found in the function
generate_lstm_vanilla()
is a two layer LSTM. It takes any real-valued sequence and attempts to predict the next values in the sequence. The model has two dropout layers, in which a random 50% of input units are set to zero at each update, to avoid overfitting (and hopefully introduce some randomness to avoid local minima, although I am not sure if this is a valid assumption).Combining Kato's original data, the analytics toolset created by @theideasmith and @lukeczapla, the off-the-shelf numerical differentiation package previously added, and the RNN described above, we now have what I believe to be an interesting reconstruction of the "Kato pipeline", further described below.
Kato's analysis can be thought of in 3 steps: (1) Collecting raw data (2) Performing numerical differentiation on the raw data (3) Performing PCA on the differentiated data, and plotting the result.
In the graph shown and described below, each row represents a pass through these 3 steps.
The graph shown is generated from the notebook KatoPipeline.ipynb.
Cell (1,1) plots Kato's raw data, as supplied by Kato. Cell (1,2) plots the differentiated data as supplied by Kato. Cell (1,3) plots our calculation of PCA on the differentiated data supplied by Kato.
Cell (2,1) again shows Kato's raw data, as supplied by Kato. Cell (2,2) plots our numerical differentiation of the raw data supplied by Kato. The differentiation is achieved using an off-the-shelf implementation of Chartrand et al, as cited by Kato. Cell (2,3) plots our calculation of PCA on our differentiation of Kato's raw data.
Cell (3,1) shows data we generated using our RNN. Cell (3,2) plots our numerical differentation of our RNN-generated data. Cell (3,3) plots our calculation of PCA on our differentiation of our RNN-generated data.
In this way, we re-create Kato's analysis pipeline using both Kato's data and our own simulated data! My hope is that this plug-and-play pipeline will be useful to analyze the output of future simulations.
@theideasmith, let's find a time to speak soon to talk through the above and review the visualizations you put together over the past few weeks as well?
I look forward to everyone's comments.