Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Polars backend #22

Merged
merged 72 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
3f8b5be
add polars eager table and implement join
finn-rudolph Aug 14, 2024
764dc2a
fix bugs in the polars table
finn-rudolph Aug 15, 2024
b00d7d5
implement mutate for polars (without grouping)
finn-rudolph Aug 15, 2024
91c33e3
add export() verb
finn-rudolph Aug 15, 2024
81915fb
make some type hints explicit
finn-rudolph Aug 19, 2024
c2e9216
add CaseExpression translation for polars
finn-rudolph Aug 19, 2024
65c5dc3
allow switch in polars
finn-rudolph Aug 19, 2024
f65fcea
make a few tests work for polars
finn-rudolph Aug 19, 2024
5164cac
make sql tests work with polars only
finn-rudolph Aug 19, 2024
d260949
implement alias for polars
finn-rudolph Aug 19, 2024
b8e0cce
add filter for polars
finn-rudolph Aug 19, 2024
fa76e4d
add arrange for polars
finn-rudolph Aug 19, 2024
044cb81
add summarise for polars
finn-rudolph Aug 20, 2024
709f44c
do not check row order in polars group_by test
finn-rudolph Aug 20, 2024
b68b980
make window functions work for polars
finn-rudolph Aug 20, 2024
f5ee657
don't check row order after summarise / group_by
finn-rudolph Aug 21, 2024
39720b9
use polars for backend equivalence tests
finn-rudolph Aug 21, 2024
63f2a6b
implement table and column printing for polars
finn-rudolph Aug 21, 2024
9c012d0
implement AlignedExpressionEvaluator in polars
finn-rudolph Aug 21, 2024
abc87c7
make aligned expression tests work
finn-rudolph Aug 21, 2024
6475683
remove unnecessary function arguments
finn-rudolph Aug 21, 2024
9bbdf9b
add is_null() function on symbolic expression
finn-rudolph Aug 21, 2024
6fc055a
implement is_null() in SQL
finn-rudolph Aug 22, 2024
48440c7
remove expr arg to _translate_function in mssql
finn-rudolph Aug 22, 2024
f610e09
use correct datetime type in polars -> mssql
finn-rudolph Aug 22, 2024
4f4bed9
delete pandas-related files
finn-rudolph Aug 22, 2024
6b9ebb8
remove lazy / eager level in the table hierarchy
finn-rudolph Aug 22, 2024
554e518
remove example test
finn-rudolph Aug 22, 2024
30d01a4
replace λ by C
finn-rudolph Aug 22, 2024
f767332
make join arguments Literal
finn-rudolph Aug 22, 2024
c646666
make join col name collision resolving like polars
finn-rudolph Aug 22, 2024
9f0b6e4
require alias before self-join
finn-rudolph Aug 22, 2024
2653992
add common aggregation functions for polars
finn-rudolph Aug 23, 2024
4598a80
make join always append a suffix to the right cols
finn-rudolph Aug 23, 2024
7857448
add tests for mutliple self-join in polars
finn-rudolph Aug 23, 2024
9f67ecc
allow partition_by= argument for window / agg fns
finn-rudolph Aug 23, 2024
c93b5ab
add fill_null for polars
finn-rudolph Aug 23, 2024
1a2afe3
move datetime operations into separate namespace
finn-rudolph Aug 23, 2024
cb6ee48
simplify namespacing on symbolic expressions
finn-rudolph Aug 23, 2024
8df16f3
auto-convert all polars numeric types to pdt types
finn-rudolph Aug 23, 2024
81ee112
add Date type
finn-rudolph Aug 23, 2024
06d08e4
add Duration type
finn-rudolph Aug 26, 2024
22f8ed7
implement Duration to unit ops in polars
finn-rudolph Aug 26, 2024
980e97a
extend Sub operator for Date / Datetime (polars)
finn-rudolph Aug 26, 2024
cd2fb11
implement filter= arg for agg functions in polars
finn-rudolph Aug 26, 2024
9c48137
translate literals to polars expressions
finn-rudolph Aug 26, 2024
f3c802c
add tests for datetime and filter argument
finn-rudolph Aug 26, 2024
14d715d
implement isin() for polars
finn-rudolph Aug 26, 2024
91e0f58
add test for partition_by argument
finn-rudolph Aug 26, 2024
512e17f
implement dense_rank for polars
finn-rudolph Aug 27, 2024
457bd0e
make rank() work in polars
finn-rudolph Aug 27, 2024
c9b2762
save non-working min / max solution for row_number
finn-rudolph Aug 27, 2024
e890ae5
make row_number work with null_last
finn-rudolph Aug 27, 2024
50408a5
remove patterns from test data
finn-rudolph Aug 27, 2024
0a4050a
implement string ops in polars
finn-rudolph Aug 27, 2024
f28e7d6
add .str namespace for string ops
finn-rudolph Aug 27, 2024
12effee
add count function in polars
finn-rudolph Aug 27, 2024
ce3ab88
prefix all datetime ops with `dt` for convention
finn-rudolph Aug 27, 2024
1202980
add count, greatest, least for polars
finn-rudolph Aug 27, 2024
d00d499
make shift arrange= work on polars
finn-rudolph Aug 27, 2024
7e32ae4
fix mistake in arrange test (non-unique ordering)
finn-rudolph Aug 27, 2024
d6a3d1c
cast to pl.Float64 when comparing dfs in tests
finn-rudolph Aug 27, 2024
090169d
add slice_head in polars
finn-rudolph Aug 27, 2024
2490f87
make partition_by= for window / agg work in SQL
finn-rudolph Aug 28, 2024
7c8cb83
remove str_join operation
finn-rudolph Aug 28, 2024
eb7dd44
simplify order_by= implementation in polars
finn-rudolph Aug 28, 2024
b624e0b
allow addition of durations in polars
finn-rudolph Aug 28, 2024
cb819dc
implement str.slice function
finn-rudolph Aug 29, 2024
f0fd4c1
add is_not_null()
finn-rudolph Aug 29, 2024
0447b57
fix bug in polars rank / dense_rank
finn-rudolph Aug 29, 2024
88220bd
add stronger test, fix bugs, remove unused tests
finn-rudolph Aug 29, 2024
7aaa3de
make DuckDB backend work
finn-rudolph Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22,013 changes: 14,431 additions & 7,582 deletions pixi.lock

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ numpy = ">=1.23.1"
pandas = ">=1.4.3"
SQLAlchemy = ">=1.4.27"
pyarrow = ">=11.0.0"
polars = ">=1.6.0"

[feature.dev.dependencies]
ruff = ">=0.5.6"
Expand Down Expand Up @@ -73,5 +74,4 @@ py312 = ["py312", "dev", "duckdb", "postgres", "mssql"]
py39ibm = ["py39", "dev", "duckdb", "postgres", "mssql", "ibm-db"]
py312ibm = ["py312", "dev", "duckdb", "postgres", "mssql", "ibm-db"]
docs = ["docs"]
release = { features=["release"], no-default-feature=true }

release = { features = ["release"], no-default-feature = true }
4 changes: 2 additions & 2 deletions src/pydiverse/transform/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from pydiverse.transform.core import functions
from pydiverse.transform.core.alignment import aligned, eval_aligned
from pydiverse.transform.core.dispatchers import verb
from pydiverse.transform.core.expressions.lambda_getter import λ
from pydiverse.transform.core.expressions.lambda_getter import C
from pydiverse.transform.core.table import Table

__all__ = [
Expand All @@ -12,5 +12,5 @@
"eval_aligned",
"functions",
"verb",
"λ",
"C",
]
12 changes: 12 additions & 0 deletions src/pydiverse/transform/core/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,14 @@ class DateTime(DType):
name = "datetime"


class Date(DType):
name = "date"


class Duration(DType):
name = "duration"


class Template(DType):
name = None

Expand Down Expand Up @@ -167,8 +175,12 @@ def dtype_from_string(t: str) -> DType:
return String(const=is_const, vararg=is_vararg)
if base_type == "bool":
return Bool(const=is_const, vararg=is_vararg)
if base_type == "date":
return Date(const=is_const, vararg=is_vararg)
if base_type == "datetime":
return DateTime(const=is_const, vararg=is_vararg)
if base_type == "duration":
return Duration(const=is_const, vararg=is_vararg)
if base_type == "none":
return NoneDType(const=is_const, vararg=is_vararg)

Expand Down
8 changes: 4 additions & 4 deletions src/pydiverse/transform/core/expressions/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ class LambdaColumn(BaseExpression):
The following fails because `table.a` gets referenced before it gets created.
table >> mutate(a = table.x) >> mutate(b = table.a)
Instead you can use a lambda column to achieve this:
table >> mutate(a = table.x) >> mutate(b = λ.a)
table >> mutate(a = table.x) >> mutate(b = C.a)
"""

__slots__ = "name"
Expand All @@ -115,10 +115,10 @@ def __init__(self, name: str):
self.name = name

def __repr__(self):
return f"<λ.{self.name}>"
return f"<C.{self.name}>"

def _expr_repr(self) -> str:
return f"λ.{self.name}"
return f"C.{self.name}"

def __eq__(self, other):
if isinstance(other, self.__class__):
Expand All @@ -129,7 +129,7 @@ def __ne__(self, other):
return not self.__eq__(other)

def __hash__(self):
return hash(("λ", self.name))
return hash(("C", self.name))


class LiteralColumn(BaseExpression, Generic[T]):
Expand Down
22 changes: 8 additions & 14 deletions src/pydiverse/transform/core/expressions/lambda_getter.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,16 @@
from pydiverse.transform.core.expressions import LambdaColumn
from pydiverse.transform.core.expressions.symbolic_expressions import SymbolicExpression

__all__ = ["C"]

class LambdaColumnGetter:
"""
An instance of this object can be used to instantiate a LambdaColumn.
"""

def __getattr__(self, item):
if item.startswith("__"):
raise AttributeError(
f"'{type(self).__name__}' object has no attribute '{item}'"
)
return SymbolicExpression(LambdaColumn(item))
class MC(type):
def __getattr__(cls, name: str) -> SymbolicExpression:
return SymbolicExpression(LambdaColumn(name))

def __getitem__(self, item):
return SymbolicExpression(LambdaColumn(item))
def __getitem__(cls, name: str) -> SymbolicExpression:
return SymbolicExpression(LambdaColumn(name))


# Global instance of LambdaColumnGetter.
λ = LambdaColumnGetter()
class C(metaclass=MC):
pass
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ def __getattr__(self, item) -> SymbolAttribute:
f"Invalid attribute {item}. Attributes can't begin and end with an"
" underscore."
)

return SymbolAttribute(item, self)

def __getitem__(self, item):
return SymbolicExpression(FunctionCall("__getitem__", self, item))

def case(self, *cases: tuple[Any, Any], default: Any = None):
def case(self, *cases: tuple[Any, Any], default: Any = None) -> SymbolicExpression:
case_expression = CaseExpression(
switching_on=self,
cases=cases,
Expand Down
12 changes: 4 additions & 8 deletions src/pydiverse/transform/core/expressions/translator.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,7 @@ def _translate(self, expr, **kwargs):

if isinstance(expr, FunctionCall):
operator = self.operator_registry.get_operator(expr.name)

# Mutate function call arguments using operator
expr_args, expr_kwargs = operator.mutate_args(expr.args, expr.kwargs)
expr = FunctionCall(expr.name, *expr_args, **expr_kwargs)
expr = FunctionCall(expr.name, *expr.args, **expr.kwargs)

op_args, op_kwargs, context_kwargs = self._translate_function_arguments(
expr, operator, **kwargs
Expand All @@ -88,7 +85,7 @@ def _translate(self, expr, **kwargs):
)

return self._translate_function(
expr, implementation, op_args, context_kwargs, **kwargs
implementation, op_args, context_kwargs, **kwargs
)

if isinstance(expr, CaseExpression):
Expand Down Expand Up @@ -118,15 +115,14 @@ def _translate(self, expr, **kwargs):
f" {expr}."
)

def _translate_col(self, expr: Column, **kwargs) -> T:
def _translate_col(self, col: Column, **kwargs) -> T:
raise NotImplementedError

def _translate_literal_col(self, expr: LiteralColumn, **kwargs) -> T:
def _translate_literal_col(self, col: LiteralColumn, **kwargs) -> T:
raise NotImplementedError

def _translate_function(
self,
expr: FunctionCall,
implementation: registry.TypedOperatorImpl,
op_args: list[T],
context_kwargs: dict[str, Any],
Expand Down
14 changes: 11 additions & 3 deletions src/pydiverse/transform/core/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,23 @@ def _sym_f_call(name, *args, **kwargs) -> SymbolicExpression[FunctionCall]:
return SymbolicExpression(FunctionCall(name, *args, **kwargs))


def count(expr: SymbolicExpression = None):
def count(expr: SymbolicExpression | None = None):
if expr is None:
return _sym_f_call("count")
else:
return _sym_f_call("count", expr)


def row_number(*, arrange: list):
return _sym_f_call("row_number", arrange=arrange)
def row_number(*, arrange: list, partition_by: list | None = None):
return _sym_f_call("row_number", arrange=arrange, partition_by=partition_by)


def rank(*, arrange: list, partition_by: list | None = None):
return _sym_f_call("rank", arrange=arrange, partition_by=partition_by)


def dense_rank(*, arrange: list, partition_by: list | None = None):
return _sym_f_call("dense_rank", arrange=arrange, partition_by=partition_by)


def case(*cases: tuple[Any, Any], default: Any = None):
Expand Down
6 changes: 3 additions & 3 deletions src/pydiverse/transform/core/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,9 +195,9 @@ class OperatorRegistry:
def __init__(self, name, super_registry=None):
self.name = name
self.super_registry = super_registry
self.registered_ops = set() # type: set[Operator]
self.implementations = dict() # type: dict[str, OperatorImplementationStore]
self.check_super = dict() # type: dict[str, bool]
self.registered_ops: set[Operator] = set()
self.implementations: dict[str, OperatorImplementationStore] = dict()
self.check_super: dict[str, bool] = dict()

def register_op(self, operator: Operator, check_super=True):
"""
Expand Down
15 changes: 9 additions & 6 deletions src/pydiverse/transform/core/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
LambdaColumn,
SymbolicExpression,
)
from pydiverse.transform.core.verbs import export


class Table(Generic[ImplT]):
Expand Down Expand Up @@ -84,12 +85,10 @@ def __copy__(self):
return self.__class__(impl_copy)

def __str__(self):
from pydiverse.transform.core.verbs import collect

try:
return (
f"Table: {self._impl.name}, backend: {type(self._impl).__name__}\n"
f"{self >> collect()}"
f"{self >> export()}"
)
except Exception as e:
return (
Expand All @@ -99,15 +98,13 @@ def __str__(self):
)

def _repr_html_(self) -> str | None:
from pydiverse.transform.core.verbs import collect

html = (
f"Table <code>{self._impl.name}</code> using"
f" <code>{type(self._impl).__name__}</code> backend:</br>"
)
try:
# TODO: For lazy backend only show preview (eg. take first 20 rows)
html += (self >> collect())._repr_html_()
html += (self >> export())._repr_html_()
except Exception as e:
html += (
"</br><pre>Failed to collect table due to an exception:\n"
Expand All @@ -117,3 +114,9 @@ def _repr_html_(self) -> str | None:

def _repr_pretty_(self, p, cycle):
p.text(str(self) if not cycle else "...")

def cols(self) -> list[Column]:
return [
self._impl.cols[uuid].as_column(name, self._impl)
for (name, uuid) in self._impl.selected_cols()
]
30 changes: 20 additions & 10 deletions src/pydiverse/transform/core/table_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,13 @@ def __init__(
self.compiler = self.ExpressionCompiler(self)
self.lambda_translator = self.LambdaTranslator(self)

self.selects = ordered_set() # type: ordered_set[str]
self.named_cols = bidict() # type: bidict[str: uuid.UUID]
self.available_cols = set() # type: set[uuid.UUID]
self.cols = {} # type: dict[uuid.UUID: ColumnMetaData]
self.selects: ordered_set[str] = ordered_set() # subset of named_cols
self.named_cols: bidict[str, uuid.UUID] = bidict()
self.available_cols: set[uuid.UUID] = set()
self.cols: dict[uuid.UUID, ColumnMetaData] = dict()

self.grouped_by = ordered_set() # type: ordered_set[Column]
self.intrinsic_grouped_by = ordered_set() # type: ordered_set[Column]
self.grouped_by: ordered_set[Column] = ordered_set()
self.intrinsic_grouped_by: ordered_set[Column] = ordered_set()

# Init Values
for name, col in columns.items():
Expand Down Expand Up @@ -185,7 +185,7 @@ def select(self, *args): ...

def mutate(self, **kwargs): ...

def join(self, right, on, how, *, validate=None): ...
def join(self, right, on, how, *, validate="m:m"): ...

def filter(self, *args): ...

Expand All @@ -199,6 +199,8 @@ def summarise(self, **kwargs): ...

def slice_head(self, n: int, offset: int): ...

def export(self): ...

#### Symbolic Operators ####

@classmethod
Expand Down Expand Up @@ -237,6 +239,10 @@ def _translate_literal(self, expr, **kwargs):
return TypedValue(literal, dtypes.String(const=True))
if isinstance(expr, datetime.datetime):
return TypedValue(literal, dtypes.DateTime(const=True))
if isinstance(expr, datetime.date):
return TypedValue(literal, dtypes.Date(const=True))
if isinstance(expr, datetime.timedelta):
return TypedValue(literal, dtypes.Duration(const=True))

if expr is None:
return TypedValue(literal, dtypes.NoneDType(const=True))
Expand Down Expand Up @@ -329,6 +335,10 @@ def _translate_literal(self, expr, **kwargs):
return TypedValue(expr, dtypes.String(const=True))
if isinstance(expr, datetime.datetime):
return TypedValue(expr, dtypes.DateTime(const=True))
if isinstance(expr, datetime.date):
return TypedValue(expr, dtypes.Date(const=True))
if isinstance(expr, datetime.timedelta):
return TypedValue(expr, dtypes.Duration(const=True))

if expr is None:
return TypedValue(expr, dtypes.NoneDType(const=True))
Expand Down Expand Up @@ -427,7 +437,7 @@ class ColumnMetaData:

@classmethod
def from_expr(cls, uuid, expr, table: AbstractTableImpl, **kwargs):
v = table.compiler.translate(expr, **kwargs)
v: TypedValue = table.compiler.translate(expr, **kwargs)
return cls(
uuid=uuid,
expr=expr,
Expand Down Expand Up @@ -469,7 +479,7 @@ def _nulls_last(_):
def _add(lhs, rhs):
return lhs + rhs

@op.extension(ops.StringAdd)
@op.extension(ops.StrAdd)
def _str_add(lhs, rhs):
return lhs + rhs

Expand All @@ -480,7 +490,7 @@ def _str_add(lhs, rhs):
def _radd(rhs, lhs):
return lhs + rhs

@op.extension(ops.StringRAdd)
@op.extension(ops.StrRAdd)
def _str_radd(lhs, rhs):
return lhs + rhs

Expand Down
Loading
Loading