Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-6284] render tsv/csv #834

Merged
merged 28 commits into from
Dec 23, 2024

Conversation

aaxelb
Copy link
Contributor

@aaxelb aaxelb commented Dec 2, 2024

allow rendering search responses as lines of tab-separated or comma-separated values

main point:

  • add simple_tsv and simple_csv renderers in trove.render
    • can be seen with query param acceptMediatype=text/tab-separated-values or acceptMediatype=text/csv
    • get default columns from static DEFAULT_TABULAR_SEARCH_COLUMN_PATHS in trove.vocab.osfmap
  • allow "download" responses -- add withFileName=foo query param to get a response with Content-Disposition: attachment and a filename based on "foo"
  • allow absurd page sizes

changes made along the way:

  • introduce ProtoRendering as renderer output type, to better decouple rendering from view logic
    • include StreamableRendering for responses that might could be streamed, like csv/tsv (tho it's not currently handled any differently from SimpleRendering)
    • reshape BaseRenderer (and each existing renderer) to have a consistent call signature (and return ProtoRendering)
      • replace trove.render.get_renderer with trove.render.get_renderer_type -- instantiate the renderer with response data
  • add trove.views._responder with common logic for building a django HttpResponse for a ProtoRendering
    • consistently handles withFileName/Content-Disposition
  • move some osf-specific constants to trove.vocab.osfmap for easier reuse
  • pull out some abstractable logic:
    • from existing trove.render.simple_json into trove.render._simple_trovesearch (for renderers that include only the list of search results)
    • from existing tests.trove.derive._base into tests.trove._input_output_tests (for tests following the same simple input/output pattern as deriver and renderer tests)
  • add tests.trove.render to cover the new renderers simple_tsv and simple_csv, as well as the existing renderers jsonapi, simple_json, jsonld, and turtle
    • minimally update existing renderers to create consistent output

@aaxelb aaxelb changed the title [wip] render tsv/csv [ENG-6284][wip] render tsv/csv Dec 2, 2024
@aaxelb aaxelb force-pushed the feat/render-tsv-csv branch 2 times, most recently from 2503bce to 47d2150 Compare December 6, 2024 18:18
@aaxelb aaxelb force-pushed the feat/render-tsv-csv branch from 47d2150 to a81abf7 Compare December 6, 2024 18:20
@coveralls
Copy link

coveralls commented Dec 6, 2024

Coverage Status

coverage: 91.833% (+0.6%) from 91.22%
when pulling fb70351 on aaxelb:feat/render-tsv-csv
into 24bc70a on CenterForOpenScience:develop.

@aaxelb aaxelb marked this pull request as ready for review December 6, 2024 18:31
@aaxelb aaxelb changed the title [ENG-6284][wip] render tsv/csv [ENG-6284] render tsv/csv Dec 6, 2024
@aaxelb aaxelb force-pushed the feat/render-tsv-csv branch from b75d4e0 to ce20fc4 Compare December 6, 2024 18:50
Copy link
Member

@mfraezz mfraezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits or questions, but nothing blocking. Tests look sufficient, but behavior should still be confirmed manually on staging.

Pass complete :octocat:

trove/render/html_browse.py Outdated Show resolved Hide resolved
trove/render/html_browse.py Show resolved Hide resolved
trove/render/simple_tsv.py Show resolved Hide resolved
@@ -14,10 +14,13 @@
MANY_MORE = -1
MAX_OFFSET = 9997

DEFAULT_PAGE_SIZE = 13
MAX_PAGE_SIZE = 10000
Copy link
Member

@mfraezz mfraezz Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Is this maximum reasonable? Looks like it was previously 101.

Edit: I see the commit message called it "absurd," but I'm guessing it's also "justified for the sake of rendering files"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the need here is downloading all results in one response, but i hesitated to make that behavior automagic by mediatype... considered making withFileName obviate pagination whenever present, but overall i opted for consistent query param behavior, putting the onus on the client to string together all the params needed for the desired result (e.g. acceptMediatype=text/csv&page[size]=10000&withFileName=my-file-name for a full csv download with up to 10000 rows)

if 10000 at once turns out to be unreasonable in practice... a more complicated (but less costly all-at-once) alternative might be view logic that queries/renders smaller pages one at a time and streams the results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: now streams, loading only one page (~100 rows) at a time, but streaming more than ~4000 items total still times out -- can further optimize or we can talk about increasing those timeouts for responses that are actively sending data...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not run into those same timeouts for ~4k items with production resourcing (or configuration -- unsure where you got that figure, but by default most nginx timeouts are between successive operations rather than the whole response), but I suspect it's fine for now and we can reevaluate if encountering that issue later.

Copy link
Contributor Author

@aaxelb aaxelb Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am seeing timeouts at just after 30sec on production
Screenshot 2025-01-02 at 10 38 01

looking at the .csv.part files, each one managed to stream a few pages before it dies... unsure what's stopping it (maybe making it less slow would help...)

@aaxelb aaxelb force-pushed the feat/render-tsv-csv branch from 88e566a to f3def1e Compare December 20, 2024 21:17
Copy link
Member

@mfraezz mfraezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :octocat:

@mfraezz mfraezz merged commit 75ab046 into CenterForOpenScience:develop Dec 23, 2024
3 checks passed
@aaxelb aaxelb deleted the feat/render-tsv-csv branch January 2, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants