Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Dataset to manage consistent user-item data processing #427

Merged
merged 43 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
b682f63
Initial documentation for a dataset class
mdekstrand Jun 12, 2024
4c03fba
small doc typo
mdekstrand Jun 12, 2024
5c462b9
Add TODO notes on data
mdekstrand Jun 13, 2024
df6bd8b
improve data loading
mdekstrand Jun 17, 2024
dddb9a1
add pyprojroot test dep
mdekstrand Jun 17, 2024
45ae996
start roughing in matrix helper
mdekstrand Jun 17, 2024
8248d98
import ratings and start working on tests
mdekstrand Jun 17, 2024
b785bcd
add item number tests
mdekstrand Jun 17, 2024
fbfceaf
add list[int] as allowed type for id lookup
mdekstrand Jun 17, 2024
a810dea
smoke-tested dataset import
mdekstrand Jun 17, 2024
de64a69
exercise more lookup code paths
mdekstrand Jun 18, 2024
fac7206
fix type errors
mdekstrand Jul 11, 2024
6a4c466
data frame normalization
mdekstrand Jul 11, 2024
7e21be5
add interaction log tests
mdekstrand Jul 11, 2024
6871d50
play with some types
mdekstrand Jul 11, 2024
8dd435f
small doc tweaks for dataset
mdekstrand Jul 12, 2024
826d6fd
fix bad typing
mdekstrand Jul 12, 2024
401c16f
refactor tests into separate modules
mdekstrand Jul 12, 2024
5f6e022
add todo comments + dataset matrix test file
mdekstrand Jul 12, 2024
c41b35a
fix stray comment
mdekstrand Jul 12, 2024
a9792be
write more matrix tests
mdekstrand Jul 12, 2024
73323cc
limit identifier types
mdekstrand Jul 12, 2024
3d558c2
define allowable entity ID types
mdekstrand Jul 12, 2024
38d006e
make tables mutable dataclasses
mdekstrand Jul 12, 2024
3e961fc
implement interaction log
mdekstrand Jul 12, 2024
04a9f93
get matrix close to working
mdekstrand Jul 12, 2024
f260516
fix pandas matrix tests
mdekstrand Jul 14, 2024
3a41d8c
get scipy tests working
mdekstrand Jul 14, 2024
7377f6b
add scipy timestamp test
mdekstrand Jul 14, 2024
74d241c
tests for legacy scipy matrices
mdekstrand Jul 14, 2024
621f9ff
finish matrix tests
mdekstrand Jul 14, 2024
08622c5
start on vocab module
mdekstrand Jul 14, 2024
7e40d92
get single-item lookups working
mdekstrand Jul 14, 2024
b134d43
work on nums / terms
mdekstrand Jul 14, 2024
1c82238
getitem
mdekstrand Jul 14, 2024
b6a2fba
try to get it working
mdekstrand Jul 14, 2024
3cf8d26
get lookup tests working
mdekstrand Jul 14, 2024
1f50e15
fix bad vocab test
mdekstrand Jul 15, 2024
7f71b3e
bail on negative numbers
mdekstrand Jul 15, 2024
84c67dd
add id / term aliases
mdekstrand Jul 15, 2024
47b9db9
support adding terms
mdekstrand Jul 15, 2024
d6e6544
fix old scipy
mdekstrand Jul 15, 2024
e2c2ab1
don't lint tests
mdekstrand Jul 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@
# how do we want to set up documentation?
autodoc_default_options = {"members": True, "member-order": "bysource", "show-inheritance": True}
autodoc_typehints = "description"
autodoc_type_aliases = {
"Iterable": "Iterable",
"ArrayLike": "ArrayLike",
}

todo_include_todos = True

# Cross-linking and external references
Expand All @@ -98,6 +103,11 @@
"implicit": ("https://benfred.github.io/implicit/", None),
}

bibtex_bibfiles = ["lenskit.bib"]
jupyter_execute_notebooks = "off"

# -- external links

extlinks = {
"issue": ("https://github.com/lenskit/lkpy/issues/%s", "🐞 %s"),
"pr": ("https://github.com/lenskit/lkpy/pull/%s", "⛙ %s"),
Expand Down
78 changes: 75 additions & 3 deletions docs/data.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,81 @@
Data Utilities
--------------
Data Management
===============

.. module:: lenskit.data

These are general-purpose data processing utilities.
LensKit provides a unified data model for recommender systems data along with
classes and utility functions for working with it, described in this section of
the manual.


.. versionchanged:: 2024.1
The new :class:`Dataset` class replaces the Pandas data frames
that were passed to algorithms in the past. It also subsumes
the old support for producing sparse matrices from rating rames.

.. _data-model:

Data Model and Key Concepts
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The LensKit data model consists of **users**, **items**, and **interactions**,
with fields providing additional (optional) data about each of these entities.
The simplest valid LensKit data set is simply a list of user and item
identifiers indicating which items each user has interacted with. These may be
augumented with ratings, timestamps, or any other attributes.

Data can be read from a range of sources, but ultimately resolves to a
collection of tables (e.g. Pandas :class:`~pandas.DataFrame`) that record user,
item, and interaction data.

.. _data-identifiers:

Identifiers
-----------

Users and items have two identifiers:

* The *identifier* as presented in the original source table(s). It appears in
LensKit data frames as ``user_id`` and ``item_id`` columns. Identifiers can
be integers, strings, or byte arrays.
* The *number* assigned by the dataset handling code. This is a 0-based
contiguous user or item number that is suitable for indexing into arrays or
matrices, a common operation in recommendation models. In data frames, this
appears as a ``user_num`` or ``item_num`` column. It is the only
representation supported by NumPy and PyTorch array formats.

User and item numbers are assigned based on sorted identifiers in the initial
data source, so reloading the same data set will yield the same numbers.
Loading a subset, however, is not guaranteed to result in the same numbers, as
the subset may be missing some users or items.

Methods that add additional users or items will assign numbers based on the
sorted identifiers that do not yet have numbers.

Identifiers and numbers can be mapped to each other with the user and item
*vocabularies* (:attr:`~Dataset.user_vocab` and :attr:`~Dataset.item_vocab`), as
well as convenience methods.

.. autodata:: lenskit.data.vocab.EntityId

.. _dataset:

Dataset Abstraction
~~~~~~~~~~~~~~~~~~~

The LensKit :class:`Dataset` class is the standard LensKit interface to datasets
for training, evaluation, etc. Trainable models and components expect a dataset
instance to be passed to :meth:`~lenskit.algorithms.Recommender.fit`.

.. autoclass:: Dataset

User-Item Data Tables
~~~~~~~~~~~~~~~~~~~~~

.. module:: lenskit.data.tables

.. autoclass:: NumpyUserItemTable
.. autoclass:: TorchUserItemTable

Building Ratings Matrices
~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
19 changes: 10 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,16 @@ Resources
releases/index

.. toctree::
:maxdepth: 2
:caption: Running Experiments

datasets
crossfold
batch
evaluation/index
documenting
parallel
:maxdepth: 2
:caption: Running Experiments

data
datasets
crossfold
batch
evaluation/index
documenting
parallel

.. toctree::
:maxdepth: 1
Expand Down
1 change: 0 additions & 1 deletion docs/util.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ Utility Functions
These utility functions are useful for data processing.

.. toctree::
data
math

Miscellaneous
Expand Down
6 changes: 5 additions & 1 deletion lenskit/lenskit/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@

from typing import Literal, TypeAlias

from .matrix import RatingMatrix, sparse_ratings # noqa: F401
from .vocab import EntityId, Vocabulary # noqa: F401, E402

FeedbackType: TypeAlias = Literal["explicit", "implicit"]
"Types of feedback supported."

from .dataset import Dataset, from_interactions_df # noqa: F401, E402
from .matrix import RatingMatrix, sparse_ratings # noqa: F401, E402
Loading
Loading