Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easy to generate a fake timeseries dataset #1

Open
MrPowers opened this issue Dec 12, 2022 · 1 comment
Open

Make it easy to generate a fake timeseries dataset #1

MrPowers opened this issue Dec 12, 2022 · 1 comment

Comments

@MrPowers
Copy link
Owner

Make it simple to generate timeseries data.

@MrPowers
Copy link
Owner Author

MrPowers commented May 7, 2023

Useful code snippet to leverage:

import itertools
from datetime import datetime, timedelta

import pyarrow as pa
import pyarrow.compute as pc
from deltalake import DeltaTable, write_deltalake

def record_observations(date: datetime) -> pa.Table:
    """Pulls data for a certain datetime"""
    nrows = 1000
    return pa.table(
        {
            "date": pa.array([date.date()] * nrows),
            "timestamp": pa.array([date] * nrows),
            "value": pc.random(nrows),
        }
    )


# Example of output
record_observations(datetime(2021, 1, 1, 12)).to_pandas()

hours_iter = (datetime(2021, 1, 1) + timedelta(hours=i) for i in itertools.count())

# Write 100 hours worth of data
for timestamp in itertools.islice(hours_iter, 100):
    write_deltalake(
        "observation_data",
        record_observations(timestamp),
        partition_by=["date"],
        mode="append",
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant