Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 62 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,89 @@
# Public Datasets For Recommender Systems
# Public Datasets for Recommender Systems

This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS). They are collected and tidied from Stack Overflow, articles, recommender sites and academic experiments. Most of the datasets presented here are free, having open sorce linceses, however, some are not and you need to ask permission to use or cite the authors' work.
This repository contains a collection of high-quality, topic-centric public data sources for Recommender Systems (RS). They have been collected and curated from Stack Overflow, articles, recommender sites, and academic experiments. Most datasets presented here are free and have open-source licenses. However, some require permission to use or citation of the authors' work.

> In addition, this repository contains some pre-processed datasets with treatment for academic experiments.
> This repository also includes some pre-processed datasets tailored for academic experiments.

## Link and datasets descriptions
## Link and Datasets Descriptions

### Book
- [Book Crossing](http://www2.informatik.uni-freiburg.de/~cziegler/BX/):: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community

| Dataset | Description |
| --- | --- |
| [Book Crossing](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) | The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August/September 2004) from the Book-Crossing community. |

### Dating
- [Dating Agency](http://www.occamslab.com/petricek/data/):: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.
| Dataset | Description |
| --- | --- |
| [Dating Agency](http://www.occamslab.com/petricek/data/) | This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as of April 4, 2006. |

### E-commerce
- [Amazon](http://jmcauley.ucsd.edu/data/amazon/):: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014
- [Retailrocket recommender system dataset](https://www.kaggle.com/retailrocket/ecommerce-dataset):: The dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.
| Dataset | Description |
| --- | --- |
| [Amazon](http://jmcauley.ucsd.edu/data/amazon/) | This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. |
| [Retailrocket Recommender System Dataset](https://www.kaggle.com/retailrocket/ecommerce-dataset) | The dataset consists of three files: behavior data (events.csv), item properties (item_properties.csv), and a category tree (category_tree.csv). The data has been collected from a real-world e-commerce website. |

### Music
- [Amazon Music](http://jmcauley.ucsd.edu/data/amazon/):: This digital music dataset contains reviews and metadata from Amazon
- [Yahoo Music](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r):: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.
- [LastFM (Implicit)](https://grouplens.org/datasets/hetrec-2011/):: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
- [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/):: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
| Dataset | Description |
| --- | --- |
| [Amazon Music](http://jmcauley.ucsd.edu/data/amazon/) | This digital music dataset contains reviews and metadata from Amazon. |
| [Yahoo Music](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r) | This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. |
| [LastFM (Implicit)](https://grouplens.org/datasets/hetrec-2011/) | Contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. |
| [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) | A freely-available collection of audio features and metadata for a million contemporary popular music tracks. |

### Movies
- [MovieLens](https://grouplens.org/datasets/movielens/):: GroupLens Research has collected and made available rating datasets from their movie web site
- [Yahoo Movies](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r):: This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services.
- [CiaoDVD](https://drive.google.com/file/d/1w1FuVSQC9nqxcK5xj0Aw5Oxc1qV7d09A/view?usp=sharing):: CiaoDVD is a dataset crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December, 2013
- [FilmTrust](https://drive.google.com/file/d/1ohQ9oo8aaR7aWlpe56hXx66x-bwXxB56/view?usp=sharing):: FilmTrust is a small dataset crawled from the entire FilmTrust website in June, 2011
- [Netflix](http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a):: This is the official data set used in the Netflix Prize competition.
### Games
| Dataset | Description |
| --- | --- |
| [MovieLens](https://grouplens.org/datasets/movielens/) | GroupLens Research has collected and made available rating datasets from their movie website. |
| [Yahoo Movies](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r) | Contains ratings for songs collected from two different sources: ratings supplied by users during normal interaction with Yahoo! Music services. |
| [CiaoDVD](https://drive.google.com/file/d/1w1FuVSQC9nqxcK5xj0Aw5Oxc1qV7d09A/view?usp=sharing) | Crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December 2013. |
| [FilmTrust](https://drive.google.com/file/d/1ohQ9oo8aaR7aWlpe56hXx66x-bwXxB56/view?usp=sharing) | A small dataset crawled from the entire FilmTrust website in June 2011. |
| [Netflix](http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a) | The official dataset used in the Netflix Prize competition. |

- [Steam Video Games](https://www.kaggle.com/tamber/steam-video-games/data):: This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.
### Games
| Dataset | Description |
| --- | --- |
| [Steam Video Games](https://www.kaggle.com/tamber/steam-video-games/data) | A list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. |

### Jokes
- [Jester](http://www.ieor.berkeley.edu/~goldberg/jester-data/):: This Joke dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users

| Dataset | Description |
| --- | --- |
| [Jester](http://www.ieor.berkeley.edu/~goldberg/jester-data/) | Contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users. |

### Food
- [Chicago Entree](http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data):: This dataset contains a record of user interactions with the Entree Chicago restaurant recommendation system.

| Dataset | Description |
| --- | --- |
| [Chicago Entree](http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data) | Contains a record of user interactions with the Entree Chicago restaurant recommendation system. |

### Anime
- [Anime Recommendations Database](https://www.kaggle.com/CooperUnion/anime-recommendations-database):: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.
| Dataset | Description |
| --- | --- |
| [Anime Recommendations Database](https://www.kaggle.com/CooperUnion/anime-recommendations-database) | Contains user preference data from 73,516 users on 12,294 anime, including user ratings. |

### Android Applications
| Dataset | Description |
| --- | --- |
| [Myket Android Application Install Dataset](https://github.com/erfanloghmani/myket-android-application-market-dataset) | Contains 694,121 application install interactions from 10,000 anonymous users and 7,988 Android applications. |

### Other Datasets
| Source | Description |
| --- | --- |
| [GroupLens Datasets](https://grouplens.org/datasets) | Collection of various recommender system datasets from GroupLens Research. |
| [LibRec Datasets](https://www.librec.net/datasets.html) | Datasets provided by the LibRec project. |
| [Yahoo Research](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r) | Collection of datasets used for research by Yahoo. |
| [Datasets for Machine Learning](https://gist.github.com/entaroadun/1653794) | A curated list of datasets useful for machine learning applications. |
| [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/) | Large network datasets collected by Stanford. |

- [Myket Android Application Install Dataset](https://github.com/erfanloghmani/myket-android-application-market-dataset):: This dataset contains 694,121 application install interactions from 10,000 anonymous users and 7,988 Anroid applications.

### Other dataset

You can find more datasets in:

- GroupLens Datasets [link](https://grouplens.org/datasets)
- LibRec Datasets [link](https://www.librec.net/datasets.html)
- Yahoo Research [link](https://webscope.sandbox.yahoo.com/catalog.php?datatype=r)
- Datasets for Machine Learning [link](https://gist.github.com/entaroadun/1653794)
- Stanford Large Network Dataset Collection [link](https://snap.stanford.edu/data/)

## Usage and License

Before using these data sets, please review their README files or sites for the usage licenses, acknowledgments and other details.
Before using these datasets, please review their README files or websites for usage licenses, acknowledgments, and other details.

`Note` : If you have difficulties in downloading any of these datasets please contact me. I have backup of all datasets.
> **Note**: If you have difficulties downloading any of these datasets, please contact me. I have backups of all datasets.

## Recommender Tools

- [Case Recommender](https://github.com/caserec/CaseRecommender):: Python.
- [MyMediaLite](http://www.mymedialite.net/):: C#.
| Tool | Language |
| --- | --- |
| [Case Recommender](https://github.com/caserec/CaseRecommender) | Python |
| [MyMediaLite](http://www.mymedialite.net/) | C# |

## Contributors

Arthur Fortes da Costa {fortes [dot] arthur [at] gmail [dot] com} [Editor]


- Arthur Fortes da Costa {fortes [dot] arthur [at] gmail [dot] com} [Editor]