pydiverse · finn-rudolph · Dec 7, 2024 · Oct 17, 2024 · Oct 18, 2024 · Oct 18, 2024
@@ -0,0 +1,13 @@
+# Best Practices for describing tabular data transformations
+
+Many data science and machine learning projects work with tabular data. Popular ways of describing transformations
+of such tabular data with python are using pandas, polars, or SQL. This section provides best practices for working
+with tabular data in python.
+
+* [beware the flatfile & embrace working with entities](/examples/best_practices_entities)
+* [start sql, finish polars](/examples/best_practices_sql_polars)
+
+```{toctree}
+/examples/best_practices_entities
+/examples/best_practices_sql_polars
+```
@@ -13,6 +13,8 @@
 ## 0.2.0 (2024-08-31)
 
 - add polars backend
+- removed pandas backend (conversion to polars needed on ingest and export)
+  * eventually, the syntax should look like this with hidden Polars conversion: `pdt.Table(df) >> ... >> export(Pandas())`
 - add Date and Duration type
 - string / datetime operations now have their separate namespace (.str / .dt)
 - add partition_by=, arrange= and filter= arguments for window / aggregation functions (filter does not work on SQL yet)

@@ -0,0 +1,86 @@
+# Database testing
+
+Relational databases are quite effective for analyzing medium size tabular data. You can leave the data in the database
+and just describe the transformation in python. All that needs to be exchanged between python and the database is the
+SQL string that is executed within the database as a `CREATE TABLE ... AS SELECT ...` statement. The database can
+execute query in an optimal and parallelized way.
+
+In practice, a relational database is already running somewhere and all you need is a connection URL and access
+credentials. See [Table Backends](table_backends.md) for a list of currently supported databases.
+
+The following example shows how to launch a postgres database in a container with docker-compose and how to work with it
+using pydiverse.transform.
+
+You can put the following example transform code in a file called `run_transform.py`:
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+import sqlalchemy as sa
+
+# initialize database=pydiverse and schema=transform
+base_engine = sa.create_engine(
+    "postgresql://sa:[email protected]:6543/postgres",
+    execution_options={"isolation_level": "AUTOCOMMIT"}
+)
+with base_engine.connect() as conn:
+    exists = len(conn.execute(sa.text("SELECT FROM pg_database WHERE datname = 'pydiverse'")).fetchall()) > 0
+    if not exists:
+        conn.execute(sa.text("CREATE DATABASE pydiverse"))
+        conn.commit()
+engine = sa.create_engine("postgresql://sa:[email protected]:6543/pydiverse")
+with engine.connect() as conn:
+    conn.execute(sa.text("CREATE SCHEMA IF NOT EXISTS transform"))
+    conn.execute(sa.text("DROP TABLE IF EXISTS transform.tbl1"))
+    conn.execute(sa.text("CREATE TABLE transform.tbl1 AS SELECT 'a' as a, 1 as b"))
+    conn.execute(sa.text("DROP TABLE IF EXISTS transform.tbl2"))
+    conn.execute(sa.text("CREATE TABLE transform.tbl2 AS SELECT 'a' as a, 1.1 as c"))
+    conn.commit()
+
+# process tables
+tbl1 = pdt.Table("tbl1", SqlAlchemy(engine, schema="transform"))
+tbl2 = pdt.Table("tbl2", SqlAlchemy(engine, schema="transform"))
+tbl1 >> left_join(tbl2, tbl1.a == tbl2.a) >> show_query() >> show()
+```
+
+If you don't have a postgres database at hand, you can start a postgres database, with the following `docker-compose.yaml` file:
+
+```yaml
+version: "3.9"
+services:
+  postgres:
+    image: postgres
+    environment:
+      POSTGRES_USER: sa
+      POSTGRES_PASSWORD: Pydiverse23
+    ports:
+      - "6543:5432"
+```
+
+Run `docker-compose up` in the directory of your `docker-compose.yaml` and then execute
+the flow script as follows with a shell like `bash` and a python environment that
+includes `pydiverse-transform` and `psycopg2`/`psycopg2-binary`:
+
+```bash
+python run_transform.py
+```
+
+Finally, you may connect to your localhost postgres database `pydiverse` and
+look at tables in schema `transform`.
+
+If you don't have a SQL UI at hand, you may use `psql` command line tool inside the docker container.
+Check out the `NAMES` column in `docker ps` output. If the name of your postgres container is
+`example_postgres_1`, then you can look at output tables like this:
+
+```bash
+docker exec example_postgres_1 psql --username=sa --dbname=pydiverse -c 'select * from transform.tbl1;'
+```
+
+Or more interactively:
+
+```bash
+docker exec -t -i example_postgres_1 bash
+psql --username=sa --dbname=pydiverse
+\dt transform.*
+select * from transform.tbl2;
+```
@@ -0,0 +1,20 @@
+# Examples
+
+Some examples how to use pydiverse.transform:
+
+* [Quickstart examples](/quickstart)
+* [Joining and Tables as Column namespaces](/examples/joining)
+* [Aggregation functions](/examples/aggregations)
+* [Window functions](/examples/window_functions)
+* [Working with real database](/database_testing)
+* [Duckdb, Polars, and Parquet](/examples/duckdb_polars_parquet)
+* [Best practices / beware the flatfile & embrace working with entities](/examples/best_practices_entities)
+
+```{toctree}
+/quickstart
+/examples/joining
+/examples/aggregations
+/examples/window_functions
+/database_testing
+/examples/duckdb_polars_parquet
+```
@@ -0,0 +1,18 @@
+# Aggregation functions
+
+Pydiverse.transform offers aggregations either grouped or ungrouped with the `summarize()` verb:
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+
+tbl1 = pdt.Table(dict(a=[1, 1, 2], b=[4, 5, 6]))
+
+tbl1 >> summarize(sum_a=sum(a), sum_b=sum(b)) >> show()
+tbl1 >> group_by(tbl1.a) >> summarize(sum_b=sum(b)) >> show()
+```
+
+Typical aggregation functions are `sum()`, `mean()`, `count()`, `min()`, `max()`, `any()`, and `all()`.
+These functions can be used in the `summarize()` verb.
+They can also be used as [window functions](/examples/window_functions) in the `mutate()` verb in case aggregated
+values shall be projected back to the rows of the original table expression.
@@ -0,0 +1,24 @@
+# Best Practices: Beware the flatfile & embrace working with entities
+
+In DataFrame libraries, joining different tables is often either cumbersome or slow. As a consequence, many data
+pipelines bring their main pieces of information together in one big table called flatfile. While this might be nice
+for quick exploration of the data, it causes several problems for long term maintenance and speed of adding new
+features:
+1. The number of columns grows very large and may become hard to overlook by the users that don't know all the prefixes
+    and suffixes by heart.
+2. Associated information with 1:n relationship either are duplicated (wasting space),
+    or written to an array column (reducing flexibility for further joins), or simply make
+    it prohibitively hard to add features on a certain granularity.
+3. In case a table is historized, storing rows for each version of a data field, the table size grows quadratic with
+    the number of columns.
+
+The other alternative is to keep column groups with similar subject matter meaning or similar data sources together in
+separate tables called entities. Especially when creating data transformation code programmatically with a nice syntax,
+it can be made quite easy to work with typical groups of entities with code in the background joining underlying tables.
+
+Often flatfiles are created before feature engineering. Due to the large number of features (columns), it becomes
+necessary to build automatic tools for executing the code for each feature in the correct order and to avoid wasteful
+execution. However, when using entity granularity (column groups of similar origin), it is more manageable to manually
+wire all feature engineering computations. It is even very valuable code to see how the different computation steps /
+entities build on each other. This makes tracking down problems much easier in debugging and helps new-joiners a chance
+to step through the code.
@@ -0,0 +1,26 @@
+# Best Practices: start sql, finish polars
+
+At the beginning of a data pipeline, there is typically the biggest amount of data touched with rather simple
+operations: Data is combined, encodings are converted/harmonized, simple aggregations and computations are performed,
+and data is heavily filtered. These operations lend themselves very well to using a powerful database, and converting
+transformations to SQL `CREATE TABLE ... AS SELECT ...` statements. This way, the data stays within the database and
+the communication heavy operations can be performed efficiently (i.e. parallelized) right where the data is stored.
+
+Towards the end of the pipeline, the vast open source ecosystem of training libraries, evaluation, and
+visualization tools is needed which are best interfaced with classical Polars / Pandas DataFrames in Memory.
+
+In the middle with feature engineering, there is still a large part of logic, that is predominantly simple enough for
+typical SQL expressiveness with some exceptions. Thus, it is super helpful if we can jump between SQL and Polars for
+performance reasons, but stay within the same pydiverse.transform syntax for describing transformations for the most
+part.
+
+When moving code to production it is often the case that prediction calls are done with much less data than during
+training. This it might not be worth setting up a sophisticated database technology, in that case. Pydiverse.transform
+allows to take code written for SQL execution during training and use the exact same code for executing on Polars for
+production. In the long run, we also want to be able to generate ONNX graphs from transform code to make long term
+reliable deployments even easier.
+
+The aim of pydiverse.transform is not feature completeness but rather versatility, ease of use, and very predictable
+and reliable behavior. Thus it should always integrate nicely with other ways of writing data transformations. Together
+with [pydiverse.pipedag](https://pydiversepipedag.readthedocs.io/en/latest/), this interoperability is made even much
+easier.
@@ -0,0 +1,24 @@
+# DuckDB, Polars, and Parquet
+
+Pydiverse.transform can swiftly switch between DuckDB and Polars based execution:
+
+```python
+from pydiverse.transform.extended import *
+import pydiverse.transform as pdt
+
+tbl = pdt.Table(dict(x=[1, 2, 3], y=[4, 5, 6]), name="A")
+tbl2 = pdt.Table(dict(x=[2, 3], z=["b", "c"]), name="B") >> collect(DuckDb())
+
+out = (
+    tbl >> collect(DuckDb()) >> left_join(tbl2, tbl.x == tbl2.x) >> show_query()
+        >> collect(Polars()) >> mutate(z=tbl.x + tbl.y) >> show()
+)
+
+df1 = out >> export(Polars())
+print(type(df1))
+
+df2 = out >> export(Polars(lazy=False))
+print(type(df2))
+```
+
+In the future, it is also intended to allow both DuckDB and Polars backends to read and write Parquet files.
@@ -0,0 +1,88 @@
+# Joining and Tables as Column namespaces
+
+If you like to combine data from two tables, you need to describe which rows of one table should be combined with which
+rows of the other table. This process is called joining. In case of a left_join, all rows of the table entering the join
+will be at least once in the output. The join condition defines which rows exactly of both tables are combined. Columns
+coming from the other table will be NULL for all rows where no match could be found. In case of an inner_join, only rows
+that have a match in both tables will be in the output.
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+
+tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))
+tbl2 = pdt.Table(dict(a=["a", "b", "b", "d"], c=[1.1, 2.2, 2.3, 4.4]), name="tbl2")
+
+tbl1 >> left_join(tbl2, tbl1.a == tbl2.a) >> show()
+tbl1 >> inner_join(tbl2, tbl1.a == tbl2.a) >> show()
+```
+
+left_join result:
+```text
+Table ?, backend: PolarsImpl
+shape: (4, 4)
+┌─────┬─────┬────────┬────────┐
+│ a   ┆ b   ┆ a_tbl2 ┆ c_tbl2 │
+│ --- ┆ --- ┆ ---    ┆ ---    │
+│ str ┆ i64 ┆ str    ┆ f64    │
+╞═════╪═════╪════════╪════════╡
+│ a   ┆ 1   ┆ a      ┆ 1.1    │
+│ b   ┆ 2   ┆ b      ┆ 2.2    │
+│ b   ┆ 2   ┆ b      ┆ 2.3    │
+│ c   ┆ 3   ┆ null   ┆ null   │
+└─────┴─────┴────────┴────────┘
+```
+
+inner_join result:
+```text
+Table ?, backend: PolarsImpl
+shape: (3, 4)
+┌─────┬─────┬────────┬────────┐
+│ a   ┆ b   ┆ a_tbl2 ┆ c_tbl2 │
+│ --- ┆ --- ┆ ---    ┆ ---    │
+│ str ┆ i64 ┆ str    ┆ f64    │
+╞═════╪═════╪════════╪════════╡
+│ a   ┆ 1   ┆ a      ┆ 1.1    │
+│ b   ┆ 2   ┆ b      ┆ 2.2    │
+│ b   ┆ 2   ┆ b      ┆ 2.3    │
+└─────┴─────┴────────┴────────┘
+```
+
+For DataFrame libraries, it is quite common that a join combines all columns of both tables, so the user then can pick
+the columns of interest for further expressions. In SQL, the act of joining is actually not bringing in any new columns.
+It only adds the columns of the joined tables to the namespace of usable columns in expressions of the `mutate` and
+`summarize` verbs.
+
+In pydiverse.transform, the empty `select()` verb can be used to hide all columns of a table. But all columns can still
+be used in further expressions:
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+
+tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))
+tbl2 = pdt.Table(dict(a=["a", "b", "b", "d"], c=[1.1, 2.2, 2.3, 4.4]), name="tbl2")
+
+(
+    tbl1
+        >> left_join(tbl2 >> select(), tbl1.a == tbl2.a) >> show()
+        >> mutate(d=tbl1.b + tbl2.c) >> show()
+)
+```
+
+*dplyr* has also a verb called `transmute` which is very similar to `mutate`, but removes/hides all columns which were
+not specified in the `mutate` call. This can be easily implemented in pydiverse.transform in user space:
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+
+@verb
+def transmute(tbl, **kwargs):
+    # the empty select() is used to hide all columns; they can still be used in subsequent mutate statements
+    return tbl >> select() >> mutate(**kwargs)
+
+tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))
+
+tbl1 >> transmute(a=tbl1.a, b_sqr=tbl1.b * tbl1.b) >> show()
+```
@@ -0,0 +1,25 @@
+# Window functions
+
+Pydiverse.transform offers window functions with the `mutate()` verb.
+Window functions are functions that operate on a set of rows related to the current row.
+They can be computed independently on groups, can use the order of rows, and can be computed only on a filtered
+subset of the table. The most simple window function is `shift(n)` which shifts a column by `n` rows. Defining an
+ordering is very important for this operation.
+
+There are two notations which define the grouping and arranging in a different way.
+The first is explicitly defining the `partition_by`, `order_by`, and `filter` arguments of the window function.
+The second makes use of existing verbs like `group_by()` and `arrange()`. However, an additional verb `ungroup()` tells
+that no `summarize()` will follow but rather that `group_by()` arguments should be used as `partition_by` and `arrange()`
+arguments as `arrange` parameters to window functions.
+
+```python
+import pydiverse.transform as pdt
+from pydiverse.transform.extended import *
+from polars.testing import assert_frame_equal
+
+tbl1 = pdt.Table(dict(a=[1, 1, 2, 2, 2, 3], b=[4, 5, 8, 7, 6, 9]))
+
+out1 = tbl1 >> mutate(b_shift=tbl1.b.shift(1, partition_by=tbl1.a, arrange=-tbl1.b, filter=tbl1.b < 8)) >> show()
+out2 = tbl1 >> group_by(tbl1.a) >> arrange(-tbl1.b) >> mutate(b_shift=tbl1.b.shift(1, filter=tbl1.b < 8)) >> ungroup() >> show()
+assert_frame_equal(out1 >> arrange(-tbl1.b) >> export(Polars()), out2 >> export(Polars()))
+```