Skip to content

Commit

Permalink
Spec: Clarify ns timestamps for ORC deserialization (#9007)
Browse files Browse the repository at this point in the history
Helps #8657

In order for ORC types `timestamp` and `timestamp_instant` to be correctly converted to Iceberg `timestamp`, `timestamp_ns`, `timestamptz`, and `timestamptz_ns`, we need an ORC type attribute.
  • Loading branch information
jacobmarble authored Nov 8, 2023
1 parent e8cf33d commit af132c7
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -1000,10 +1000,10 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
| **`decimal(P,S)`** | `decimal` | | |
| **`date`** | `date` | | |
| **`time`** | `long` | `iceberg.long-type`=`TIME` | Stores microseconds from midnight. |
| **`timestamp`** | `timestamp` | | Stores microseconds from 2015-01-01 00:00:00.000000. [1], [2] |
| **`timestamptz`** | `timestamp_instant` | | Stores microseconds from 2015-01-01 00:00:00.000000 UTC. [1], [2] |
| **`timestamp_ns`** | `timestamp` | | Stores nanoseconds from 2015-01-01 00:00:00.000000000. [1] |
| **`timestamptz_ns`** | `timestamp_instant` | | Stores nanoseconds from 2015-01-01 00:00:00.000000000 UTC. [1] |
| **`timestamp`** | `timestamp` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000. [1], [2] |
| **`timestamptz`** | `timestamp_instant` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000 UTC. [1], [2] |
| **`timestamp_ns`** | `timestamp` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000. [1] |
| **`timestamptz_ns`** | `timestamp_instant` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000 UTC. [1] |
| **`string`** | `string` | | ORC `varchar` and `char` would also map to **`string`**. |
| **`uuid`** | `binary` | `iceberg.binary-type`=`UUID` | |
| **`fixed(L)`** | `binary` | `iceberg.binary-type`=`FIXED` & `iceberg.length`=`L` | The length would not be checked by the ORC reader and should be checked by the adapter. |
Expand All @@ -1015,7 +1015,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
Notes:

1. ORC's [TimestampColumnVector](https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.html) consists of a time field (milliseconds since epoch) and a nanos field (nanoseconds within the second). Hence the milliseconds within the second are reported twice; once in the time field and again in the nanos field. The read adapter should only use milliseconds within the second from one of these fields. The write adapter should also report milliseconds within the second twice; once in the time field and again in the nanos field. ORC writer is expected to correctly consider millis information from one of the fields. More details at https://issues.apache.org/jira/browse/ORC-546
2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` **must** truncate nanoseconds to microseconds.
2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` **must** truncate nanoseconds to microseconds. `iceberg.timestamp-unit` is assumed to be `MICROS` if not present.

One of the interesting challenges with this is how to map Iceberg’s schema evolution (id based) on to ORC’s (name based). In theory, we could use Iceberg’s column ids as the column and field names, but that would be inconvenient.

Expand Down

0 comments on commit af132c7

Please sign in to comment.