Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get started with top level documentation and implement little features to make example code nicer. #34

Merged
merged 8 commits into from
Dec 7, 2024
13 changes: 13 additions & 0 deletions docs/source/best_practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Best Practices for describing tabular data transformations

Many data science and machine learning projects work with tabular data. Popular ways of describing transformations
of such tabular data with python are using pandas, polars, or SQL. This section provides best practices for working
with tabular data in python.

* [beware the flatfile & embrace working with entities](/examples/best_practices_entities)
* [start sql, finish polars](/examples/best_practices_sql_polars)

```{toctree}
/examples/best_practices_entities
/examples/best_practices_sql_polars
```
2 changes: 2 additions & 0 deletions docs/source/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
## 0.2.0 (2024-08-31)

- add polars backend
- removed pandas backend (conversion to polars needed on ingest and export)
* eventually, the syntax should look like this with hidden Polars conversion: `pdt.Table(df) >> ... >> export(Pandas())`
- add Date and Duration type
- string / datetime operations now have their separate namespace (.str / .dt)
- add partition_by=, arrange= and filter= arguments for window / aggregation functions (filter does not work on SQL yet)
Expand Down
86 changes: 86 additions & 0 deletions docs/source/database_testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Database testing

Relational databases are quite effective for analyzing medium size tabular data. You can leave the data in the database
and just describe the transformation in python. All that needs to be exchanged between python and the database is the
SQL string that is executed within the database as a `CREATE TABLE ... AS SELECT ...` statement. The database can
execute query in an optimal and parallelized way.

In practice, a relational database is already running somewhere and all you need is a connection URL and access
credentials. See [Table Backends](table_backends.md) for a list of currently supported databases.

The following example shows how to launch a postgres database in a container with docker-compose and how to work with it
using pydiverse.transform.

You can put the following example transform code in a file called `run_transform.py`:

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *
import sqlalchemy as sa

# initialize database=pydiverse and schema=transform
base_engine = sa.create_engine(
"postgresql://sa:[email protected]:6543/postgres",
execution_options={"isolation_level": "AUTOCOMMIT"}
)
with base_engine.connect() as conn:
exists = len(conn.execute(sa.text("SELECT FROM pg_database WHERE datname = 'pydiverse'")).fetchall()) > 0
if not exists:
conn.execute(sa.text("CREATE DATABASE pydiverse"))
conn.commit()
engine = sa.create_engine("postgresql://sa:[email protected]:6543/pydiverse")
with engine.connect() as conn:
conn.execute(sa.text("CREATE SCHEMA IF NOT EXISTS transform"))
conn.execute(sa.text("DROP TABLE IF EXISTS transform.tbl1"))
conn.execute(sa.text("CREATE TABLE transform.tbl1 AS SELECT 'a' as a, 1 as b"))
conn.execute(sa.text("DROP TABLE IF EXISTS transform.tbl2"))
conn.execute(sa.text("CREATE TABLE transform.tbl2 AS SELECT 'a' as a, 1.1 as c"))
conn.commit()

# process tables
tbl1 = pdt.Table("tbl1", SqlAlchemy(engine, schema="transform"))
tbl2 = pdt.Table("tbl2", SqlAlchemy(engine, schema="transform"))
tbl1 >> left_join(tbl2, tbl1.a == tbl2.a) >> show_query() >> show()
```

If you don't have a postgres database at hand, you can start a postgres database, with the following `docker-compose.yaml` file:

```yaml
version: "3.9"
services:
postgres:
image: postgres
environment:
POSTGRES_USER: sa
POSTGRES_PASSWORD: Pydiverse23
ports:
- "6543:5432"
```

Run `docker-compose up` in the directory of your `docker-compose.yaml` and then execute
the flow script as follows with a shell like `bash` and a python environment that
includes `pydiverse-transform` and `psycopg2`/`psycopg2-binary`:

```bash
python run_transform.py
```

Finally, you may connect to your localhost postgres database `pydiverse` and
look at tables in schema `transform`.

If you don't have a SQL UI at hand, you may use `psql` command line tool inside the docker container.
Check out the `NAMES` column in `docker ps` output. If the name of your postgres container is
`example_postgres_1`, then you can look at output tables like this:

```bash
docker exec example_postgres_1 psql --username=sa --dbname=pydiverse -c 'select * from transform.tbl1;'
```

Or more interactively:

```bash
docker exec -t -i example_postgres_1 bash
psql --username=sa --dbname=pydiverse
\dt transform.*
select * from transform.tbl2;
```
20 changes: 20 additions & 0 deletions docs/source/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Examples

Some examples how to use pydiverse.transform:

* [Quickstart examples](/quickstart)
* [Joining and Tables as Column namespaces](/examples/joining)
* [Aggregation functions](/examples/aggregations)
* [Window functions](/examples/window_functions)
* [Working with real database](/database_testing)
* [Duckdb, Polars, and Parquet](/examples/duckdb_polars_parquet)
* [Best practices / beware the flatfile & embrace working with entities](/examples/best_practices_entities)

```{toctree}
/quickstart
/examples/joining
/examples/aggregations
/examples/window_functions
/database_testing
/examples/duckdb_polars_parquet
```
18 changes: 18 additions & 0 deletions docs/source/examples/aggregations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Aggregation functions

Pydiverse.transform offers aggregations either grouped or ungrouped with the `summarize()` verb:

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *

tbl1 = pdt.Table(dict(a=[1, 1, 2], b=[4, 5, 6]))

tbl1 >> summarize(sum_a=sum(a), sum_b=sum(b)) >> show()
tbl1 >> group_by(tbl1.a) >> summarize(sum_b=sum(b)) >> show()
```

Typical aggregation functions are `sum()`, `mean()`, `count()`, `min()`, `max()`, `any()`, and `all()`.
These functions can be used in the `summarize()` verb.
They can also be used as [window functions](/examples/window_functions) in the `mutate()` verb in case aggregated
values shall be projected back to the rows of the original table expression.
24 changes: 24 additions & 0 deletions docs/source/examples/best_practices_entities.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Best Practices: Beware the flatfile & embrace working with entities

In DataFrame libraries, joining different tables is often either cumbersome or slow. As a consequence, many data
pipelines bring their main pieces of information together in one big table called flatfile. While this might be nice
for quick exploration of the data, it causes several problems for long term maintenance and speed of adding new
features:
1. The number of columns grows very large and may become hard to overlook by the users that don't know all the prefixes
and suffixes by heart.
2. Associated information with 1:n relationship either are duplicated (wasting space),
or written to an array column (reducing flexibility for further joins), or simply make
it prohibitively hard to add features on a certain granularity.
3. In case a table is historized, storing rows for each version of a data field, the table size grows quadratic with
the number of columns.

The other alternative is to keep column groups with similar subject matter meaning or similar data sources together in
separate tables called entities. Especially when creating data transformation code programmatically with a nice syntax,
it can be made quite easy to work with typical groups of entities with code in the background joining underlying tables.

Often flatfiles are created before feature engineering. Due to the large number of features (columns), it becomes
necessary to build automatic tools for executing the code for each feature in the correct order and to avoid wasteful
execution. However, when using entity granularity (column groups of similar origin), it is more manageable to manually
wire all feature engineering computations. It is even very valuable code to see how the different computation steps /
entities build on each other. This makes tracking down problems much easier in debugging and helps new-joiners a chance
to step through the code.
26 changes: 26 additions & 0 deletions docs/source/examples/best_practices_sql_polars.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Best Practices: start sql, finish polars

At the beginning of a data pipeline, there is typically the biggest amount of data touched with rather simple
operations: Data is combined, encodings are converted/harmonized, simple aggregations and computations are performed,
and data is heavily filtered. These operations lend themselves very well to using a powerful database, and converting
transformations to SQL `CREATE TABLE ... AS SELECT ...` statements. This way, the data stays within the database and
the communication heavy operations can be performed efficiently (i.e. parallelized) right where the data is stored.

Towards the end of the pipeline, the vast open source ecosystem of training libraries, evaluation, and
visualization tools is needed which are best interfaced with classical Polars / Pandas DataFrames in Memory.

In the middle with feature engineering, there is still a large part of logic, that is predominantly simple enough for
typical SQL expressiveness with some exceptions. Thus, it is super helpful if we can jump between SQL and Polars for
performance reasons, but stay within the same pydiverse.transform syntax for describing transformations for the most
part.

When moving code to production it is often the case that prediction calls are done with much less data than during
training. This it might not be worth setting up a sophisticated database technology, in that case. Pydiverse.transform
allows to take code written for SQL execution during training and use the exact same code for executing on Polars for
production. In the long run, we also want to be able to generate ONNX graphs from transform code to make long term
reliable deployments even easier.

The aim of pydiverse.transform is not feature completeness but rather versatility, ease of use, and very predictable
and reliable behavior. Thus it should always integrate nicely with other ways of writing data transformations. Together
with [pydiverse.pipedag](https://pydiversepipedag.readthedocs.io/en/latest/), this interoperability is made even much
easier.
24 changes: 24 additions & 0 deletions docs/source/examples/duckdb_polars_parquet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# DuckDB, Polars, and Parquet

Pydiverse.transform can swiftly switch between DuckDB and Polars based execution:

```python
from pydiverse.transform.extended import *
import pydiverse.transform as pdt

tbl = pdt.Table(dict(x=[1, 2, 3], y=[4, 5, 6]), name="A")
tbl2 = pdt.Table(dict(x=[2, 3], z=["b", "c"]), name="B") >> collect(DuckDb())

out = (
tbl >> collect(DuckDb()) >> left_join(tbl2, tbl.x == tbl2.x) >> show_query()
>> collect(Polars()) >> mutate(z=tbl.x + tbl.y) >> show()
)

df1 = out >> export(Polars())
print(type(df1))

df2 = out >> export(Polars(lazy=False))
print(type(df2))
```

In the future, it is also intended to allow both DuckDB and Polars backends to read and write Parquet files.
88 changes: 88 additions & 0 deletions docs/source/examples/joining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Joining and Tables as Column namespaces

If you like to combine data from two tables, you need to describe which rows of one table should be combined with which
rows of the other table. This process is called joining. In case of a left_join, all rows of the table entering the join
will be at least once in the output. The join condition defines which rows exactly of both tables are combined. Columns
coming from the other table will be NULL for all rows where no match could be found. In case of an inner_join, only rows
that have a match in both tables will be in the output.

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *

tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))
tbl2 = pdt.Table(dict(a=["a", "b", "b", "d"], c=[1.1, 2.2, 2.3, 4.4]), name="tbl2")

tbl1 >> left_join(tbl2, tbl1.a == tbl2.a) >> show()
tbl1 >> inner_join(tbl2, tbl1.a == tbl2.a) >> show()
```

left_join result:
```text
Table ?, backend: PolarsImpl
shape: (4, 4)
┌─────┬─────┬────────┬────────┐
│ a ┆ b ┆ a_tbl2 ┆ c_tbl2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞═════╪═════╪════════╪════════╡
│ a ┆ 1 ┆ a ┆ 1.1 │
│ b ┆ 2 ┆ b ┆ 2.2 │
│ b ┆ 2 ┆ b ┆ 2.3 │
│ c ┆ 3 ┆ null ┆ null │
└─────┴─────┴────────┴────────┘
```

inner_join result:
```text
Table ?, backend: PolarsImpl
shape: (3, 4)
┌─────┬─────┬────────┬────────┐
│ a ┆ b ┆ a_tbl2 ┆ c_tbl2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞═════╪═════╪════════╪════════╡
│ a ┆ 1 ┆ a ┆ 1.1 │
│ b ┆ 2 ┆ b ┆ 2.2 │
│ b ┆ 2 ┆ b ┆ 2.3 │
└─────┴─────┴────────┴────────┘
```

For DataFrame libraries, it is quite common that a join combines all columns of both tables, so the user then can pick
the columns of interest for further expressions. In SQL, the act of joining is actually not bringing in any new columns.
It only adds the columns of the joined tables to the namespace of usable columns in expressions of the `mutate` and
`summarize` verbs.

In pydiverse.transform, the empty `select()` verb can be used to hide all columns of a table. But all columns can still
be used in further expressions:

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *

tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))
tbl2 = pdt.Table(dict(a=["a", "b", "b", "d"], c=[1.1, 2.2, 2.3, 4.4]), name="tbl2")

(
tbl1
>> left_join(tbl2 >> select(), tbl1.a == tbl2.a) >> show()
>> mutate(d=tbl1.b + tbl2.c) >> show()
)
```

*dplyr* has also a verb called `transmute` which is very similar to `mutate`, but removes/hides all columns which were
not specified in the `mutate` call. This can be easily implemented in pydiverse.transform in user space:

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *

@verb
def transmute(tbl, **kwargs):
# the empty select() is used to hide all columns; they can still be used in subsequent mutate statements
return tbl >> select() >> mutate(**kwargs)

tbl1 = pdt.Table(dict(a=["a", "b", "c"], b=[1, 2, 3]))

tbl1 >> transmute(a=tbl1.a, b_sqr=tbl1.b * tbl1.b) >> show()
```
25 changes: 25 additions & 0 deletions docs/source/examples/window_functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Window functions

Pydiverse.transform offers window functions with the `mutate()` verb.
Window functions are functions that operate on a set of rows related to the current row.
They can be computed independently on groups, can use the order of rows, and can be computed only on a filtered
subset of the table. The most simple window function is `shift(n)` which shifts a column by `n` rows. Defining an
ordering is very important for this operation.

There are two notations which define the grouping and arranging in a different way.
The first is explicitly defining the `partition_by`, `order_by`, and `filter` arguments of the window function.
The second makes use of existing verbs like `group_by()` and `arrange()`. However, an additional verb `ungroup()` tells
that no `summarize()` will follow but rather that `group_by()` arguments should be used as `partition_by` and `arrange()`
arguments as `arrange` parameters to window functions.

```python
import pydiverse.transform as pdt
from pydiverse.transform.extended import *
from polars.testing import assert_frame_equal

tbl1 = pdt.Table(dict(a=[1, 1, 2, 2, 2, 3], b=[4, 5, 8, 7, 6, 9]))

out1 = tbl1 >> mutate(b_shift=tbl1.b.shift(1, partition_by=tbl1.a, arrange=-tbl1.b, filter=tbl1.b < 8)) >> show()
out2 = tbl1 >> group_by(tbl1.a) >> arrange(-tbl1.b) >> mutate(b_shift=tbl1.b.shift(1, filter=tbl1.b < 8)) >> ungroup() >> show()
assert_frame_equal(out1 >> arrange(-tbl1.b) >> export(Polars()), out2 >> export(Polars()))
```
Loading
Loading