Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add how="align_left" to pl.concat() for faster alignment #20637

Open
drumtorben opened this issue Jan 9, 2025 · 1 comment · May be fixed by #20644
Open

Feature Request: Add how="align_left" to pl.concat() for faster alignment #20637

drumtorben opened this issue Jan 9, 2025 · 1 comment · May be fixed by #20644
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@drumtorben
Copy link

drumtorben commented Jan 9, 2025

Description

Currently, in pl.concat() with how="align", we can combine multiple DataFrames by auto-determining the common key columns. According to the documentation, this operation always performs a full outer join, which can be relatively slow for large datasets.

I noticed that in the internal function pl.align_frames(), it is possible to set how="left" for alignment, which speeds up the process significantly.

Proposal

It would be useful to introduce a how="align_left" option in pl.concat() that performs alignment by always using the keys from the first DataFrame for left joins. This would be a faster alternative to the current how="align" behavior.

Benefits

  • Performance Improvement: Aligning frames with a left join is generally faster than performing a full outer join.
  • Flexibility: Provides users with more control over how frames are aligned when combining them.
  • Consistency: Leverages existing functionality in pl.align_frames() to enhance pl.concat().

Suggested Implementation

  • Add a new how="align_left" value to pl.concat().
  • Under the hood, use pl.align_frames() with how="left" to handle the alignment.

Example

df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [1, 2], "c": [5, 6]})

# Current behavior
result = pl.concat([df1, df2], how="align")  # Full outer join on common columns

# Proposed behavior
result = pl.concat([df1, df2], how="align_left")  # Left join on keys from df1

Additional Notes

If this feature is feasible, updating the documentation and examples to clarify the difference between how="align" and how="align_left" would be essential.

@drumtorben drumtorben added the enhancement New feature or an improvement of an existing feature label Jan 9, 2025
@drumtorben drumtorben changed the title Feature Request: Add how="align_left" to pl.concat() for faster alignment Feature Request: Add how="align_left" to pl.concat() for faster alignment Jan 9, 2025
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 9, 2025

FYI: concat doesn't actually use align_frames internally as it needs to create a single result frame (whereas align_frames creates as many output frames as there are input frames). However, supporting more generic alignment in concat does look straightforward so I'll take a look shortly ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants