-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of dynamic graphs and code for ROLAND. #23
base: dynamic
Are you sure you want to change the base?
Changes from 61 commits
c38aba1
961140f
2d57c12
2774bbd
015b86c
158d859
e746748
6d0cf03
55c384e
505e017
c22f1a2
238c477
5c05fa4
c60761f
e4c8173
13ff46c
e9c71f1
55f9e76
c014e68
2c9ba8c
0cd303c
bc1b0bc
57984c1
4efdd0c
bc8f2c7
7bba9d7
0a87698
34cc91d
4826e76
22db15d
a40de86
11162c4
71eb77f
ffc2cc9
6156397
57fd6eb
f61cffb
9102744
9f7a5fd
f0b72f7
7045ceb
6ef2747
267e11d
bf28bf1
7d71297
9e6336f
998dae0
89c5197
93f7ff3
ddf58c6
2ea233c
c03bb12
190f49f
18bb300
5a42cb6
c77ec97
f0579b4
6eabfa9
2466f5c
f89fbbf
f975321
0671c0b
71e0086
bd0527a
5f4f45f
c6e35a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,8 @@ | ||
**/data_dir/ | ||
run/datasets/data/ | ||
run/results/ | ||
run/runs_*/ | ||
**/__pycache__/ | ||
**/.ipynb_checkpoints | ||
.idea/ | ||
.idea/ | ||
.vscode/settings.json |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# ROLAND: Graph Neural Networks for Dynamic Graphs | ||
This repository contains code associated with the ROLAND project and more. | ||
You can firstly walk through the *how-to* sections to run experiments on existing | ||
public datasets. | ||
After understanding how to run and analyze experiments, you can read through the *development topics* to run our | ||
|
||
|
||
## TODO: add figures to illustrate the ROLAND framework. | ||
|
||
## How to Download Datasets | ||
Most of datasets are used in our paper can be found at `https://snap.stanford.edu/data/index.html`. | ||
|
||
```bash | ||
# Or Use your own dataset directory. | ||
mkdir ./all_datasets/ | ||
cd ./all_datasets | ||
wget 'https://snap.stanford.edu/data/soc-sign-bitcoinotc.csv.gz' | ||
wget 'https://snap.stanford.edu/data/soc-sign-bitcoinalpha.csv.gz' | ||
wget 'https://snap.stanford.edu/data/as-733.tar.gz' | ||
wget 'https://snap.stanford.edu/data/CollegeMsg.txt.gz' | ||
wget 'https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv' | ||
wget 'https://snap.stanford.edu/data/soc-redditHyperlinks-title.tsv' | ||
wget 'http://snap.stanford.edu/data/web-redditEmbeddings-subreddits.csv' | ||
|
||
# Unzip files | ||
gunzip CollegeMsg.txt.gz | ||
gunzip soc-sign-bitcoinalpha.csv.gz | ||
gunzip soc-sign-bitcoinotc.csv.gz | ||
tar xf ./as-733.tar.gz | ||
|
||
# Rename files, this step is required by our loader. | ||
# You can leave the web-redditEmbeddings-subreddits.csv file unchanged. | ||
mv ./soc-sign-bitcoinotc.csv ./bitcoinotc.csv | ||
mv ./soc-sign-bitcoinalpha.csv ./bitcoinalpha.csv | ||
|
||
mv ./soc-redditHyperlinks-body.tsv ./reddit-body.tsv | ||
mv ./soc-redditHyperlinks-title.tsv ./reddit-title.tsv | ||
``` | ||
You should expect 740 files, including the zipped `as-733.tar.gz`, by checking `ls | wc -l`. | ||
The total disk space required is approximately 950MiB. | ||
## How to Run Single Experiments from Our Paper | ||
**WARNING**: for each `yaml` file in `./run/configs/ROLAND`, you need to update the `dataset.dir` field to the correct path of datasets downloaded above. | ||
|
||
The ROLAND project focuses on link-predictions for homogenous dynamic graphs. | ||
Here we demonstrate example runs using | ||
|
||
To run link-prediction task on `CollegeMsg.txt` dataset with default settings: | ||
```bash | ||
cd ./run | ||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_ucimsg.yaml --repeat 1 | ||
``` | ||
For other datasets: | ||
```bash | ||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_btcalpha.yaml --repeat 1 | ||
|
||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_btcotc.yaml --repeat 1 | ||
|
||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_ucimsg.yaml --repeat 1 | ||
|
||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_reddittitle.yaml --repeat 1 | ||
|
||
python3 main_dynamic.py --cfg configs/ROLAND/roland_gru_redditbody.yaml --repeat 1 | ||
``` | ||
The `--repeat` argument controls for number of random seeds used for each experiment. For example, setting `--repeat 3` runs each single experiments for three times with three different random seeds. | ||
|
||
To explore training result: | ||
```bash | ||
cd ./run | ||
tensorboard --logdir=./runs_live_update --port=6006 | ||
``` | ||
**WARNING** The x-axis of plots in tensorboard is **not** epochs, they are snapshot IDs (e.g., the $i^{th}$ day or the $i^{th}$ week) instead. | ||
|
||
## Examples on Heterogenous Graph Snapshots | ||
```bash | ||
Under development. | ||
``` | ||
|
||
## How to Run Grid Search / Batch Experiments | ||
To run grid search / batch experiments, one needs a `main.py` file, a `base_config.yaml`, and a `grid.txt` file. The main and config files are the same as in the single experiment setup above. | ||
If one wants to do link-prediction on `CollegeMsg.txt` dataset with configurations from `configs/ROLAND/roland_gru_ucimsg.yaml`, in addition, she wants to try out (1) *different numbers of GNN message passing layers* and (2) *different learning rates*. | ||
In this case, one can use the following grid file: | ||
```text | ||
# grid.txt, lines starting with # are comments. | ||
gnn.layers_mp mp [2,3,4,5] | ||
optim.base_lr lr [0.003,0.01,0.03] | ||
``` | ||
**WARNING**: the format of each line is crucial: `NAME_IN_YAML<space>SHORT_ALIAS<space>LIST_OF_VALUES`, and there should **not** be any space in the list of values. | ||
|
||
The `grid.txt` above will generate $4\times 3=12$ different configurations by modifying `gnn.layers_mp` and `gnn.layers_mp` to the respective levels in base config file `roland_gru_ucimsg.yaml`. | ||
|
||
Please see `./run/grids/ROLAND/example_grid.txt` for a complete example of grid search text file. | ||
|
||
To run the experiment using `example_grid.txt`: | ||
```bash | ||
bash ./run_roland_batch.sh | ||
``` | ||
## How to Export Tensorboard Results to CSV | ||
We provide a simple script to aggregate results from a batch of tensorboard files, please feel free to look into `tabulate_events.py` and modify it. | ||
```bash | ||
# Usage: python3 ./tabulate_events.py <tensorboard_logdir> <output_file_name> | ||
python3 ./tabulate_events.py ./live_update ./out.csv | ||
``` | ||
|
||
## Development Topic: Use Your Own Dataset | ||
We provided two examples of constructing your own datasets, please refer to | ||
(1) `./graphgym/contrib/loader/roland_template.py` and (2) `./graphgym/contrib/loader/roland_template_hetero.py` for examples of building loaders. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
from yacs.config import CfgNode as CN | ||
|
||
from graphgym.register import register_config | ||
|
||
|
||
def set_cfg_roland(cfg): | ||
""" | ||
This function sets the default config value for customized options | ||
:return: customized configuration use by the experiment. | ||
""" | ||
|
||
# ----------------------------------------------------------------------- # | ||
# Customized options | ||
# ----------------------------------------------------------------------- # | ||
|
||
# Use to identify experiments, tensorboard will be saved to this path. | ||
# Options: any string. | ||
cfg.remark = '' | ||
|
||
# ----------------------------------------------------------------------- # | ||
# Additional GNN options. | ||
# ----------------------------------------------------------------------- # | ||
# Method to update node embedding from old node embedding and new node features. | ||
# Options: {'moving_average', 'mlp', 'gru'} | ||
cfg.gnn.embed_update_method = 'moving_average' | ||
|
||
# How many layers to use in the MLP updater. | ||
# Options: integers >= 1. | ||
# NOTE: there is a known issue when set to 1, use >= 2 for now. | ||
# Only effective when cfg.gnn.embed_update_method == 'mlp'. | ||
cfg.gnn.mlp_update_layers = 2 | ||
|
||
# What kind of skip-connection to use. | ||
# Options: {'none', 'identity', 'affine'}. | ||
cfg.gnn.skip_connection = 'none' | ||
|
||
# The bath size while making link prediction, useful when number of negative | ||
# edges is huge, use a smaller number depends on GPU memroy size.. | ||
cfg.gnn.link_pred_batch_size = 500000 | ||
# ----------------------------------------------------------------------- # | ||
# Meta-Learning options. | ||
# ----------------------------------------------------------------------- # | ||
# For meta-learning. | ||
cfg.meta = CN() | ||
# Whether to do meta-learning via initialization moving average. | ||
# Options: {True, False} | ||
cfg.meta.is_meta = False | ||
|
||
# Weight used in moving average for model parameters. | ||
# After fine-tuning the model in period t and get model M[t], | ||
# Set W_init = (1-alpha) * W_init + alpha * M[t]. | ||
# For the next period, use W_init as the initialization for fine-tune | ||
# Set cfg.meta.alpha = 1.0 to recover the original algorithm. | ||
# Options: float between 0.0 and 1.0. | ||
cfg.meta.alpha = 0.9 | ||
|
||
# ----------------------------------------------------------------------- # | ||
# Additional GNN options. | ||
# ----------------------------------------------------------------------- # | ||
# How many snapshots for the truncated back-propagation. | ||
# Set to a very large integer to use full-back-prop-through-time | ||
# Options: integers >= 1. | ||
cfg.train.tbptt_freq = 10 | ||
|
||
# Early stopping tolerance in live-update. | ||
# Options: integers >= 1. | ||
cfg.train.internal_validation_tolerance = 5 | ||
|
||
# Computing MRR is slow in the baseline setting. | ||
# Only start to compute MRR in the test set range after certain time. | ||
# Options: integers >= 0. | ||
cfg.train.start_compute_mrr = 0 | ||
|
||
# ----------------------------------------------------------------------- # | ||
# Additional dataset options. | ||
# ----------------------------------------------------------------------- # | ||
|
||
# How to handle node features in AS-733 dataset. | ||
# Options: ['one', 'one_hot_id', 'one_hot_degree_global'] | ||
cfg.dataset.AS_node_feature = 'one' | ||
|
||
# Method used to sample negative edges for edge_label_index. | ||
# Options: | ||
# 'uniform': all non-existing edges have same probability of being sampled | ||
# as negative edges. | ||
# 'src': non-existing edges from high-degree nodes are more likely to be | ||
# sampled as negative edges. | ||
# 'dest': non-existing edges pointed to high-degree nodes are more likely | ||
# to be sampled as negative edges. | ||
cfg.dataset.negative_sample_weight = 'uniform' | ||
|
||
# Whether to load dataset as heterogeneous graphs. | ||
# Options: {True, False}. | ||
cfg.dataset.is_hetero = False | ||
|
||
# whether to look for and load cached graph. By default (load_cache=False) | ||
# the loader loads the raw tsv file from disk and | ||
cfg.dataset.load_cache = False | ||
|
||
cfg.dataset.premade_datasets = 'fresh' | ||
|
||
cfg.dataset.include_node_features = False | ||
|
||
# 'chronological_temporal' or 'default'. | ||
# 'chronological_temporal': only for temporal graphs, for example, | ||
# the first 80% snapshots are for training, then subsequent 10% snapshots | ||
# are for validation and the last 10% snapshots are for testing. | ||
cfg.dataset.split_method = 'default' | ||
|
||
# In the case of live-update, whether to predict all edges at time t+1. | ||
cfg.dataset.link_pred_all_edges = False | ||
# ----------------------------------------------------------------------- # | ||
# Customized options: `transaction` for ROLAND dynamic graphs. | ||
# ----------------------------------------------------------------------- # | ||
|
||
# example argument group | ||
cfg.transaction = CN() | ||
|
||
# whether use snapshot | ||
cfg.transaction.snapshot = False | ||
|
||
# snapshot split method 1: number of snapshots | ||
# split dataset into fixed number of snapshots. | ||
cfg.transaction.snapshot_num = 100 | ||
|
||
# snapshot split method 2: snapshot frequency | ||
# e.g., one snapshot contains transactions within 1 day. | ||
cfg.transaction.snapshot_freq = 'D' | ||
|
||
cfg.transaction.check_snapshot = False | ||
|
||
# how to use transaction history | ||
# full or rolling | ||
cfg.transaction.history = 'full' | ||
|
||
# type of loss: supervised / meta | ||
cfg.transaction.loss = 'meta' | ||
|
||
# feature dim for int edge features | ||
cfg.transaction.feature_int_dim = 32 | ||
cfg.transaction.feature_edge_int_num = [50, 8, 252, 252, 3, 3] | ||
cfg.transaction.feature_node_int_num = [0] | ||
|
||
# feature dim for amount (float) edge feature | ||
cfg.transaction.feature_amount_dim = 64 | ||
|
||
# feature dim for time (float) edge feature | ||
cfg.transaction.feature_time_dim = 64 | ||
|
||
# | ||
cfg.transaction.node_feature = 'raw' | ||
|
||
# how many days look into the future | ||
cfg.transaction.horizon = 1 | ||
|
||
# prediction mode for the task; 'before' or 'after' | ||
cfg.transaction.pred_mode = 'before' | ||
|
||
# number of periods to be captured. | ||
# set to a list of integers if wish to use pre-defined periodicity. | ||
# e.g., [1,7,28,31,...] etc. | ||
cfg.transaction.time_enc_periods = [1] | ||
|
||
# if 'enc_before_diff': attention weight = diff(enc(t1), enc(t2)) | ||
# if 'diff_before_enc': attention weight = enc(t1 - t2) | ||
cfg.transaction.time_enc_mode = 'enc_before_diff' | ||
|
||
# how to compute the keep ratio while updating the recurrent GNN. | ||
# the update ratio (for each node) is a function of its degree in [0, t) | ||
# and its degree in snapshot t. | ||
cfg.transaction.keep_ratio = 'linear' | ||
|
||
# ----------------------------------------------------------------------- # | ||
# Customized options: metrics. | ||
# ----------------------------------------------------------------------- # | ||
|
||
cfg.metric = CN() | ||
# How many negative edges for each node to compute rank-based evaluation | ||
# metrics such as MRR and recall at K. | ||
# E.g., if multiplier = 1000 and a node has 3 positive edges, then we | ||
# compute the MRR using 1000 randomly generated negative edges | ||
# + 3 existing positive edges. | ||
# Use 100 ~ 1000 for fast and reliable results. | ||
cfg.metric.mrr_num_negative_edges = 1000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can we set negative edges as all the edges? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can set |
||
|
||
# how to compute MRR. | ||
# available: f = 'min', 'max', 'mean'. | ||
# Step 1: get the p* = f(scores of positive edges) | ||
# Step 2: compute the rank r of p* among all negative edges. | ||
# Step 3: RR = 1 / rank. | ||
# Step 4: average over all users. | ||
# expected MRR(min) <= MRR(mean) <= MRR(max). | ||
cfg.metric.mrr_method = 'max' | ||
|
||
|
||
register_config('roland', set_cfg_roland) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can start writing the example, without worrying about heterogenous layer yet.