The Yelp dataset is processed from the public-available Yelp Open Dataset. Details for the Yelp dataset are as follows:
-
Node: One node in the Yelp Dataset represents an active user. We recursively remove inactive users (users with no friends or no reviews) until all the users in the dataset have at least one friend in the dataset.
-
Edge: We add an edge between two nodes A and B if user B is in the friend list of user A, or vice versa.
-
Node feature: The node feature contains information of all the reviews by the users. To generate node feature, we first separate all the reviews of one user into words. Then, we use the pre-trained Word2Vec model from GoogleNews to convert the words into 300-dimensional vectors. The words which are not included in the GoogleNews model are ignored. Finally, we compute the average of all the word vectors of a user to serve as the feature of that node.
-
Node label: The label of one node represents the types of business (i.e., Coffee & Tea, Mexican Restaurants, Flowers & Gifts, Tours, etc.) that the user has been to. We collect all types of business for each users and select the top 100 most common types among all users in the graph as the labels.
-
Train/Val/Test split: The Train/Val/Test nodes are randomly split to 0.75/0.10/0.15. The training adjacency matrix is generated by the induced sub-graph of the training nodes.
The Amazon dataset is processed from the bipartite user-item graph from Amazon. The problem here is to predict an Amazon product category based on the text of the reviews. Details for the Amazon dataset are as follows:
-
Node: One node in the Amazon Dataset represents one product listed on the Amazon website.
-
Edge: We add an edge between two nodes A and B if the buyers of product A and the buyers of B overlaps.
-
Node feature: The node feature contains information of all the reviews on that product. The raw features of each user are provided as a sparse vector of character 4-grams from the reviews. Each non-zero element in the vector represents the count of the 4-gram. Since the raw feature vectors are too long (length-100k for each node), we use SVD to reduce the dimensionality down to length-200 to serve as the GCN inputs.
-
Node label: The label of one node represents the categories of the product (e.g., books, movie, shoes). In the original dataset, there exists a lot of "rare categories" that correspond to very few number of nodes. We eliminate those categories and keep only the most frequently occuring 107 categories. Note: as a result, a small percent of nodes may now have no label because they do not belong to any of the 107 categories. We leave them as they are because we think this just reflects the nature of the dataset.
-
Train/Val/Test split: The Train/Val/Test nodes are randomly split to 0.85/0.05/0.10. The training adjacency matrix is generated by the induced sub-graph of the training nodes.
-
Remark: in the GNN literature, we are aware of two other versions of the Amazon dataset -- one used by Cluster-GCN and the other included in the Open Graph Benchmark. Our version of Amazon is NOT exactly the same as those two versions. Specifically, we obtained the raw data from another source and did the preprocessing on our own. You are free to try GraphSAINT on any version of Amazon. As an example, the OGB team has implemented GraphSAINT using PyTorch Geometric and generated the initial results here. However, we ourselves haven't tuned GraphSAINT on the OGB version of Amazon.
The Flickr dataset is processed from the public-available NUS-WIDE Dataset and Flickr Image Relationships. Details for the Flickr dataset are as follows:
-
Node: One node in the Flickr Dataset represents one picture uploaded to the Flickr website. We select the pictures present both in NUS-WIDE Dataset and Flickr Image Relationships as the nodes in the Flickr Dataset.
-
Edge: One edge in the Flickr Dataset represents one link in the Flickr Image Relationships. Flickr Image Relationships form links between images from the same location, submitted to the same gallery, sharing common tags, taken by friends, etc. We ignore the links where the pictures are not present in the Flickr dataset.
-
Node feature: The node feature contains information of low-level feature from NUS-WIDE Dataset.
-
Node label: The label of one node represents the tag of the pictures. We manually merge the 81 tags of the NUS-WIDE Dataset to 7 tags including human, nature, animal, etc.
-
Train/Val/Test split: The Train/Val/Test nodes are randomly split to 0.5/0.25/0.25. The training adjacency matrix is generated by the induced sub-graph of the training nodes.