Default column name behavior when there is no `.alias()` call #339

wenleix · 2022-11-25T04:10:14Z

wenleix
Nov 25, 2022

Describe the bug
Not sure if it's by design, but want to check the default column name behavior upon column expression (when there is no .alias() call)

To Reproduce
For trivial expressions like col("A"), col("B"), it will use the original column name if no .alias() is specified . This is expected (and kind of consistent with SQL as SELECT a, b:

>>> from daft import DataFrame, col
>>> df = DataFrame.from_pydict({
...     "A": [1, 2, 3, 4],
...     "B": [1.5, 2.5, 3.5, 4.5]
...  })
>>> df.select(col("A"), col("B")).show(2)
   A    B
0  1  1.5
1  2  2.5

For expression like col("B") * 2, it will use "B" as output column name if no .alias() is specified :

>>> df.select(col("A"), col("B") * 2).show(2)
   A    B
0  1  3.0
1  2  5.0

Not sure if this is by design (and this behavior will be kept in the future).

Especially, the output column name for col("A") + col("B") will be "A":

>>> df.select(col("B") * 2, col("A") + col("B")).show(2)
     B    A
0  3.0  2.5
1  5.0  4.5

Expected behavior
Not sure what's the best strategy -- shall we explicitly ask for .alias() call for column expression like col("A") + col("B")? Some SQL engine will assign column names like _col0, _col1 upon things like SELECT b * 2, a + b

Answered by jaychia

Nov 26, 2022

Hi @wenleix thanks for bringing this up!

Indeed - for naming, we currently default to using the left expression's name if no alias is specified. Specifically, for the example col("A") + col("B") we keep "A" as the name. We decided to keep this as the default behavior because oftentimes the semantic meaning of a column is still kept after performing some corrections on a column. For example:

df.select(df["year"] + 1)  # perform some corrections

In the above example, the "year" column still semantically means a year, and can thus continue to be referred to as "year" in downstream operations.

When this default case is not intended, users can provide external input using .alias(), assigning …

View full answer

jaychia · 2022-11-26T01:58:08Z

jaychia
Nov 26, 2022
Maintainer

Hi @wenleix thanks for bringing this up!

Indeed - for naming, we currently default to using the left expression's name if no alias is specified. Specifically, for the example col("A") + col("B") we keep "A" as the name. We decided to keep this as the default behavior because oftentimes the semantic meaning of a column is still kept after performing some corrections on a column. For example:

df.select(df["year"] + 1)  # perform some corrections

In the above example, the "year" column still semantically means a year, and can thus continue to be referred to as "year" in downstream operations.

When this default case is not intended, users can provide external input using .alias(), assigning a new semantic meaning to the field. For example:

df.select((df["year"] - 2000).alias("years_after_2000"))

That being said, I just played around with PostgreSQL and it looks like they give the column an anonymous name "?column?" when an operation is executed. This seems reasonable as a default as well - i.e. we no longer know what the intended semantic meaning of a field is after any operation is executed on it. However, for Daft these columns are often re-used later, and having these unnamed columns makes that really difficult.

We like to adhere to having sensible defaults, and we deemed defaulting to the left expression's name as the most sensible default here. That being said, we're open to feedback as to why this might be unexpected behavior!

0 replies

jaychia · 2022-11-28T23:00:55Z

jaychia
Nov 28, 2022
Maintainer

We're going to document this behavior better, issue tracking: #340

Thanks @wenleix!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default column name behavior when there is no `.alias()` call #339

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Default column name behavior when there is no .alias() call #339

wenleix Nov 25, 2022

Replies: 2 comments

jaychia Nov 26, 2022 Maintainer

jaychia Nov 28, 2022 Maintainer

Default column name behavior when there is no `.alias()` call #339

wenleix
Nov 25, 2022

jaychia
Nov 26, 2022
Maintainer

jaychia
Nov 28, 2022
Maintainer