Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ post Mar 22 ] revise database schema for storing lottery data #515

Open
yunlanli opened this issue Mar 17, 2022 · 1 comment
Open

[ post Mar 22 ] revise database schema for storing lottery data #515

yunlanli opened this issue Mar 17, 2022 · 1 comment
Labels
help wanted Extra attention is needed lottery predictor

Comments

@yunlanli
Copy link
Contributor

yunlanli commented Mar 17, 2022

Description

Previously, since we only have one year of lottery data, it made sense to co-locate the lottery number and the room information in a single table. Now that we have multiple years of data, this schema is no longer suitable. If we add one column for each year's lottery data, the table will soon become very large. Moreover, it would be hard to mark a single year's data as an outlier (maybe a 32-bit bitmap will work, but probably not a brilliant idea).

I'm thinking of using one table for storing room information with a composite primary key consisting of residence hall and room number, and other tables for storing lottery data and a foreign key that references the table containing room information.

I think this needs further discussion. For example, should we assign each dorm an ID as a primary key? Should we store lottery numbers each year in a single table or spread them into mulitple identified by year? It might be worth creating views to cater to our analytics needs.

@yunlanli yunlanli added help wanted Extra attention is needed lottery predictor labels Mar 17, 2022
@yunlanli
Copy link
Contributor Author

yunlanli commented Apr 3, 2022

Here are some challenges that we face with the current schema:

Lottery Data Aggregation

a) lottery data for each year is stored in a separate table with (residence hall, room, room suffix) as the primary key
b) lottery data from different year may contain different rooms: e.g (Schapiro, 1203, 1) may exist in 2021 but not in 2020.

Idea 1: A single SQL query

Our MySQL version (5.27) doesn't support full outer join, only left and right outer join are supported, so

  • it would be hard to write a single SQL query to obtain lottery data aggregate
  • it's already hard to account for b) with 2 years of data, the SQL query could become quickly unmaintainable with several years of data to account for b).
  • too many join operations

This approach seems inefficient and can quickly become unmaintainable.

Idea 2: A new Table / View

Will add details later. Thinking about

  • a table for dorm: assign a room id to each room, add the constraint that (residence hall, room, room suffix) must be unique
  • a table for lottery data, primary key: room id, a column for each year's lottery data
  • a table / view for aggregate lottery data

Every year's lottery data go through an ETL process and eventually end up in the lottery data table.
The idea is to make this scheme optimal for analytics query.

Lottery Data Filter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed lottery predictor
Projects
None yet
Development

No branches or pull requests

1 participant