Skip to content

Commit

Permalink
upgrade to latest quickstart_etl from examples
Browse files Browse the repository at this point in the history
  • Loading branch information
shalabhc committed May 18, 2023
1 parent 07b5288 commit d2497cc
Show file tree
Hide file tree
Showing 15 changed files with 130 additions and 64 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/dagster-cloud-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,27 +113,27 @@ jobs:
# username: _json_key
# password: ${{ secrets.GCR_JSON_KEY }}

# Build "example_location" location.
# Build "quickstart_etl" location.
# For each code location, the "build-push-action" builds the docker
# image and a "set-build-output" command records the image tag for each code location.
# To re-use the same docker image across multiple code locations, build the docker image once
# and specify the same tag in multiple "set-build-output" commands. To use a different docker
# image for each code location, use multiple "build-push-actions" with a location specific
# tag.
- name: Build and upload Docker image for "example_location"
- name: Build and upload Docker image for "quickstart_etl"
if: steps.prerun.outputs.result != 'skip'
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ env.IMAGE_REGISTRY }}:${{ env.IMAGE_TAG }}-example-location
tags: ${{ env.IMAGE_REGISTRY }}:${{ env.IMAGE_TAG }}-quickstart-etl

- name: Update build session with image tag for example_location
- name: Update build session with image tag for quickstart_etl
id: ci-set-build-output-example-location
if: steps.prerun.outputs.result != 'skip'
uses: dagster-io/dagster-cloud-action/actions/utils/[email protected]
with:
command: "ci set-build-output --location-name=data-eng-pipeline --image-tag=$IMAGE_TAG-example-location"
command: "ci set-build-output --location-name=data-eng-pipeline --image-tag=$IMAGE_TAG-quickstart-etl"

# Deploy all code locations in this build session to Dagster Cloud
- name: Deploy to Dagster Cloud
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Click the `Use this Template` button and provide details for your new repo.

## Step 2. Add your Docker registry to `dagster_cloud.yaml`

The [`dagster_cloud.yaml`](./dagster_cloud.yaml) file defines the configuration for building and deploying your code locations. For the `example_location`, specify the Docker registry in the `registry:` key:
The [`dagster_cloud.yaml`](./dagster_cloud.yaml) file defines the configuration for building and deploying your code locations. For the `quickstart_etl`, specify the Docker registry in the `registry:` key:

https://github.com/dagster-io/dagster-cloud-hybrid-quickstart/blob/669cc3acac00a070b38ec50e0c158b0c3d8b6996/dagster_cloud.yaml#L7

Expand Down Expand Up @@ -62,7 +62,7 @@ Set up secrets on your newly created repository by navigating to the `Settings`

## Step 5. Verify builds are successful

At this point, the workflow run should complete successfully and you should see the `example_location` in https://dagster.cloud. If builds are failing, ensure that your
At this point, the workflow run should complete successfully and you should see the `quickstart_etl` in https://dagster.cloud. If builds are failing, ensure that your
secrets are properly set up the workflow properly sets up Docker regsitry access.

<img width="993" alt="Screen Shot 2022-08-08 at 9 07 25 PM" src="https://user-images.githubusercontent.com/10215173/183562119-90375ca1-c119-4154-8e30-8b85916628b8.png">
Expand All @@ -87,7 +87,7 @@ https://github.com/dagster-io/dagster-cloud-hybrid-quickstart/blob/9f63f62b1a7ca

## Customize the Docker build process

A standard `Dockerfile` is included in this project and used to build the `example_location`. This file is used by the `build-push-action`:
A standard `Dockerfile` is included in this project and used to build the `quickstart_etl`. This file is used by the `build-push-action`:

https://github.com/dagster-io/dagster-cloud-hybrid-quickstart/blob/fa0a0d3409fda4c342da41c970f568d32996747f/.github/workflows/dagster-cloud-deploy.yml#L123-L129

Expand All @@ -105,5 +105,5 @@ The `ci-init` step accepts a `location_names` input string containing a JSON lis
project_dir: ${{ env.DAGSTER_PROJECT_DIR }}
dagster_cloud_yaml_path: ${{ env.DAGSTER_CLOUD_YAML_PATH }}
deployment: 'prod'
location_names: '["example_location1", "location2"]' # only deploy these two locations
location_names: '["quickstart_etl1", "location2"]' # only deploy these two locations
```
8 changes: 2 additions & 6 deletions dagster_cloud.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
locations:
- location_name: example_location
- location_name: quickstart_etl
code_source:
package_name: my_dagster_project
build:
directory: ./
registry: <account-id>.dkr.ecr.us-west-2.amazonaws.com/branch-deployments-gh-action-test

package_name: quickstart_etl
1 change: 0 additions & 1 deletion my_dagster_project/__init__.py

This file was deleted.

17 changes: 0 additions & 17 deletions my_dagster_project/assets/__init__.py

This file was deleted.

11 changes: 0 additions & 11 deletions my_dagster_project/repository.py

This file was deleted.

11 changes: 0 additions & 11 deletions my_dagster_project_tests/test_assets.py

This file was deleted.

3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.dagster]
module_name = "quickstart_etl"
16 changes: 16 additions & 0 deletions quickstart_etl/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from dagster import (
Definitions,
ScheduleDefinition,
define_asset_job,
load_assets_from_package_module,
)

from . import assets

daily_refresh_schedule = ScheduleDefinition(
job=define_asset_job(name="all_assets_job"), cron_schedule="0 0 * * *"
)

defs = Definitions(
assets=load_assets_from_package_module(assets), schedules=[daily_refresh_schedule]
)
Empty file.
83 changes: 83 additions & 0 deletions quickstart_etl/assets/hackernews.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import base64
from io import BytesIO
from typing import List

import matplotlib.pyplot as plt
import pandas as pd
import requests
from dagster import MetadataValue, OpExecutionContext, asset
from wordcloud import STOPWORDS, WordCloud


@asset(group_name="hackernews", compute_kind="HackerNews API")
def hackernews_topstory_ids() -> List[int]:
"""Get up to 500 top stories from the HackerNews topstories endpoint.
API Docs: https://github.com/HackerNews/API#new-top-and-best-stories
"""
newstories_url = "https://hacker-news.firebaseio.com/v0/topstories.json"
top_500_newstories = requests.get(newstories_url).json()
return top_500_newstories


@asset(group_name="hackernews", compute_kind="HackerNews API")
def hackernews_topstories(
context: OpExecutionContext, hackernews_topstory_ids: List[int]
) -> pd.DataFrame:
"""Get items based on story ids from the HackerNews items endpoint. It may take 1-2 minutes to fetch all 500 items.
API Docs: https://github.com/HackerNews/API#items
"""
results = []
for item_id in hackernews_topstory_ids:
item = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{item_id}.json").json()
results.append(item)
if len(results) % 20 == 0:
context.log.info(f"Got {len(results)} items so far.")

df = pd.DataFrame(results)

# Dagster supports attaching arbitrary metadata to asset materializations. This metadata will be
# shown in the run logs and also be displayed on the "Activity" tab of the "Asset Details" page in the UI.
# This metadata would be useful for monitoring and maintaining the asset as you iterate.
# Read more about in asset metadata in https://docs.dagster.io/concepts/assets/software-defined-assets#recording-materialization-metadata
context.add_output_metadata(
{
"num_records": len(df),
"preview": MetadataValue.md(df.head().to_markdown()),
}
)
return df


@asset(group_name="hackernews", compute_kind="Plot")
def hackernews_topstories_word_cloud(
context: OpExecutionContext, hackernews_topstories: pd.DataFrame
) -> bytes:
"""Exploratory analysis: Generate a word cloud from the current top 500 HackerNews top stories.
Embed the plot into a Markdown metadata for quick view.
Read more about how to create word clouds in http://amueller.github.io/word_cloud/.
"""
stopwords = set(STOPWORDS)
stopwords.update(["Ask", "Show", "HN"])
titles_text = " ".join([str(item) for item in hackernews_topstories["title"]])
titles_cloud = WordCloud(stopwords=stopwords, background_color="white").generate(titles_text)

# Generate the word cloud image
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(titles_cloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)

# Save the image to a buffer and embed the image into Markdown content for quick view
buffer = BytesIO()
plt.savefig(buffer, format="png")
image_data = base64.b64encode(buffer.getvalue())
md_content = f"![img](data:image/png;base64,{image_data.decode()})"

# Attach the Markdown content as metadata to the asset
# Read about more metadata types in https://docs.dagster.io/_apidocs/ops#metadata-types
context.add_output_metadata({"plot": MetadataValue.md(md_content)})

return image_data
File renamed without changes.
1 change: 1 addition & 0 deletions quickstart_etl_tests/test_assets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
[metadata]
name = my_dagster_project
name = quickstart_etl
23 changes: 15 additions & 8 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
from setuptools import find_packages, setup

if __name__ == "__main__":
setup(
name="my_dagster_project",
packages=find_packages(exclude=["my_dagster_project_tests"]),
install_requires=[
"dagster",
],
)
setup(
name="quickstart_etl",
packages=find_packages(exclude=["quickstart_etl_tests"]),
install_requires=[
"dagster",
"dagster-cloud",
"boto3",
"pandas",
"matplotlib",
"textblob",
"tweepy",
"wordcloud",
],
extras_require={"dev": ["dagit", "pytest"]},
)

0 comments on commit d2497cc

Please sign in to comment.