Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] expose hl.fast_stat and hl.hadoop_fast_stat to users #13886

Closed
wants to merge 8 commits into from

Conversation

danking
Copy link
Contributor

@danking danking commented Oct 23, 2023

CHANGELOG: Introduce hl.fs.fast_stat and hl.hadoop_fast_stat which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of hl.hadoop_stat and hl.fs.stat should consider switching.

This PR extends #13885 into the public API.

Dan King added 3 commits October 23, 2023 10:45
CHANGELOG: Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.

Inspired by @jigold 's file system improvement campaign, I pursued the avoidance of "list"
operations. I anticipate this reduces flakiness in Azure (which is tracked in hail-is#13351) and cost in
Azure.

I enforced aiotools.fs terminology on hail.fs and Scala:

1. `FileStatus`. Metadata about a blob or file. It does not know if a directory exists at this path.

2. `FileListEntry`. Metadata from a list operation. It knows if a directory exists at this path.

Variable names were updated to reflect this distinction:

1. `fileStatus` / `fileStatuses`

2. `fle`/ `fles` / `fileListEntry` / `fileListEntries`, respectively.

`listStatus` renamed to `listDirectory` for clarity.

In both Azure and Google, `fileStatus` does not use a list operation.

`fileListEntry` can be used when we must know if a directory exists. I just rewrote this from
first principles because:
1. In neither Google nor Azure did it check if the path was a directory and a file.
2. In Google, if the directory entry wasn't in the first page, it would fail (NB: there are fifteen
   non-control characters in ASCII before `/`, if the page size is 15 or fewer, we'd miss the first
   entry with a `/` at the end).
3. In Azure, we issued both a get and a list.

There are now unit tests for this method.

---

1. `copyMerge` and `concatenateFiles` previously used `O(N_FILES)` list operations, they now use
   `O(N_FILES)` get operations.
2. Writers that used `exists` to check for a _SUCCESS file now use a get operation.
3. Index readers, import BGEN, and import plink all now check file size with a get operation.

That said, overall, the bulk of our Class A Operations are probably writes.
@danking danking force-pushed the reduce-class-a-followup branch from a149118 to eeedc94 Compare October 23, 2023 15:50
Dan King added 4 commits October 23, 2023 12:29
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching.

This PR extends hail-is#13885 into the public API.
@danking danking force-pushed the reduce-class-a-followup branch from eeedc94 to 28d9e12 Compare October 23, 2023 16:58
@danking danking closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant