Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

90 featbackend simplify database tables with json columns #95

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
56cfe90
Update interactive project creation
pekasen Nov 7, 2024
a36e875
Typos in from_dict and field_types
pekasen Nov 8, 2024
8c62044
New database model
pekasen Nov 8, 2024
ca34756
Add documentation to Router.
pekasen Nov 8, 2024
d5c2e6f
Add a default DataFrame, if the sampler's state is empty.
pekasen Nov 8, 2024
86fd2d6
New configuration model
pekasen Nov 8, 2024
89df4f1
Adjustments in Router and connected tests.
pekasen Nov 8, 2024
04bcd91
Adjustments in Configuration and connected tests.
pekasen Nov 8, 2024
4ed7bf9
Fix accidentally deleted fixture.
pekasen Nov 8, 2024
b36dd04
Fix typo.
pekasen Nov 8, 2024
5112c83
Add Pydantic for configuration validation.
pekasen Nov 8, 2024
ef1125e
Fix Python 3.8/3.9 compatibility.
pekasen Nov 8, 2024
b91893e
Add PlugIn to top-level module exports.
pekasen Nov 20, 2024
5e466a4
Add sampler specs to configuration types.
pekasen Nov 21, 2024
9329dfd
fix(Router): Better error description while validating.
pekasen Nov 21, 2024
d18629b
Make'm tests pass.
pekasen Nov 21, 2024
dd40727
Log node table in bedug mode.
pekasen Nov 21, 2024
98ae989
fix(spikyball): NANs should not propagate
pekasen Dec 4, 2024
1cc08b3
fix: set all possible NAs to 1
pekasen Dec 5, 2024
a8931d9
fix: while setting all NaNs to 1, correctly test for NaNs.
pekasen Dec 5, 2024
711fc0d
fix: include NAs and not only NaNs.
pekasen Dec 5, 2024
852eea8
fix: its a mistery why the normalization constant becomes NaN.
pekasen Dec 5, 2024
91fcb7d
update documentation
pekasen Dec 5, 2024
0cac941
Update .gitignore
pekasen Dec 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,5 @@ notebooks/

# spiderexpress files
*.pe.yml

.DS_Store
72 changes: 46 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,6 @@ In the future we will provide a PyPI package which will make the installation pr

## Usage



```bash
$ spiderexpress --help
Usage: spiderexpress [OPTIONS] COMMAND [ARGS]...
Expand All @@ -64,6 +62,7 @@ Options:
Commands:
create create a new configuration
start start a job
list list all available connectors and strategies
```

### create
Expand All @@ -80,6 +79,12 @@ Options:
--interactive / --non-interactive
```

This leads you through a dialogue to create a configuration file for your project.
You need to supply at least the following information:
- the name of the project,
- a path for the seed file.

```bash
### start

This command starts a `spiderexpress` job with the given configuration file.
Expand All @@ -100,7 +105,7 @@ A `spiderexpress` project could for example look like this:
├── my_project
│ ├── my_project.pe.yml
│ ├── my_project.db
│ └── seed_file.txt
│ └── seed_file.json
```

Whereas `my_project.db` is the resulting database, `my_project.pe.yml` is the project's configuration in which a data source and sampling strategy and other parameters may be specified (see [Configuration](#configuration) for further details) and `seed_file.txt` is a text file which contains one node name per line.
Expand All @@ -114,35 +119,50 @@ For example projects, please refer to the `examples` directory or the unit tests
The resulting file could look like something like this example:

```yaml
project_name: spider
batch_size: 150
db_url: test2.sqlite
db_url: sqlite:/// # in memory database for testing
db_schema:
empty_seeds: stop
max_iteration: 10000
edge_table_name: edge_list
node_table_name: node_list
seeds:
- ...
connector: telegram
strategy:
spikyball:
layer_max_size: 150
layers:
test:
eager: false
connector:
csv:
node_list_location: tests/stubs/7th_graders/nodes.csv
edge_list_location: tests/stubs/7th_graders/edges.csv
mode: out
routers:
- all: # This is the name of the router and should be the type of edge.
source: source # This is the field that is mapped to the source columns.
target:
- field: target # This is the field that is mapped to the target columns.
dispatch_with: test # This is the name of the layer to dispatch to.
sampler:
source_node_probability:
coefficient: 1
weights:
subscriber_count: 4
videos_count: 1
target_node_probability:
coefficient: 1
weights:
edge_probability:
coefficient: 1
weights:
views: 1
random:
n: 5
project_name: spider
seeds:
test:
- "1"
- "13"
```

> [!note] Nomenclature
> `spiderexpress` uses `source`and `target` as they variable names for the edges, nodes must have a `name`.

## Table Schemas

`spiderexpress` is an entirely persistent application, meaning that all of its state information are kept,
including the data retrieved from the connectors, dense network-layer, the sparse network as well as the application's own
data.

Most of the user-facing data is kept in the following few tables:

- `raw_data`, here we keep the raw data, as returned by the connector.
- `layer_dense_edges` and `layer_dense_nodes`, here we keep the network information on the dense network.
- `layer_sparse_edges` and `layer_sparse_nodes`, sample as above, but only the sampled edges are included.


How the tables are structured is determined by the configuration file.
The following sections describe the minimal configuration for the tables as well as the configuration syntax.

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ click = "*"
psycopg2-binary = "*"
PyYAML = "*"
transitions = "*"
pydantic = "^2.9.2"

[tool.poetry.dev-dependencies]
ipykernel = "*"
Expand Down
4 changes: 3 additions & 1 deletion spiderexpress/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
""".. include:: ../README.md"""

from .spider import Spider
from .types import Configuration
from .types import Configuration, PlugIn

__all__ = ["Spider", "Configuration", "PlugIn"]

__version__ = "0.1.0a0"
34 changes: 20 additions & 14 deletions spiderexpress/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,14 @@
- Refine verbs/commands for the CLI
- find a mechanism for stopping/starting collections
"""

import sys
from importlib.metadata import entry_points
from pathlib import Path
from loguru import logger as log

import click
import yaml
import sys
from loguru import logger as log

from .spider import CONNECTOR_GROUP, STRATEGY_GROUP, Spider
from .types import Configuration
Expand All @@ -34,22 +35,21 @@ def cli(ctx):
@cli.command()
@click.argument("config", type=click.Path(path_type=Path, exists=True))
@click.option("-v", "--verbose", count=True)
@click.option("-l", "--logfile", type=click.Path(dir_okay=False, writable=True, path_type=str))
@click.option(
"-l", "--logfile", type=click.Path(dir_okay=False, writable=True, path_type=str)
)
@click.pass_context
def start(ctx: click.Context, config: Path, verbose: int, logfile: str):
"""start a job"""
logging_level = max(50 - (10 * verbose), 0) # Allows logging level to be between 0 and 50.
logging_level = max(
50 - (10 * verbose), 0
) # Allows logging level to be between 0 and 50.
logging_configuration = {
"handlers": [
{
"sink": logfile or sys.stdout,
"level": logging_level
}
],
"extra": {}
"handlers": [{"sink": logfile or sys.stdout, "level": logging_level}],
"extra": {},
}
log.configure(**logging_configuration)
log.debug(f"Starting logging with verbosity {logging_level}.")
log.debug(f"Starting logging with verbosity {logging_level}.")
ctx.obj.start(config)


Expand All @@ -62,7 +62,13 @@ def create(config: str, interactive: bool):

if interactive:
for key, description in [
("seeds", "add seeds?"),
("project_name", "Name of your project?"),
("db_url", "URL of your database?"),
("max_iteration", "How many iterations should be done?"),
(
"empty_seeds",
"What should happen if seeds are empty? Can be 'stop' or 'retry'",
),
("seed_file", "do you wish to read a file for seeds?"),
]:
args[key] = click.prompt(description)
Expand All @@ -76,7 +82,7 @@ def create(config: str, interactive: bool):
@cli.command()
def list(): # pylint: disable=W0622
"""list all plugins"""
click.echo("--- connectors ---", color="blue")
click.echo("--- connectors ---")
for connector in entry_points(group=CONNECTOR_GROUP):
click.echo(connector.name)
click.echo("--- strategies ---")
Expand Down
14 changes: 9 additions & 5 deletions spiderexpress/connectors/csv.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
"""A CSV-reading, network-rippin' connector for your testing purposes."""

import dataclasses
from typing import Dict, List, Optional, Tuple, Union

import pandas as pd

from spiderexpress.types import PlugIn, fromdict
from spiderexpress.types import PlugIn, from_dict

_cache = {}


@dataclasses.dataclass
class CSVConnectorConfiguration:
"""Configuration items for the csv_connector."""
Expand All @@ -23,7 +25,7 @@ def csv_connector(
) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""The CSV connector!"""
if isinstance(configuration, dict):
configuration = fromdict(CSVConnectorConfiguration, configuration)
configuration = from_dict(CSVConnectorConfiguration, configuration)

if configuration.cache:
if configuration.edge_list_location not in _cache:
Expand Down Expand Up @@ -61,9 +63,11 @@ def csv_connector(

return (
edge_return,
nodes.loc[nodes.name.isin(node_ids), :]
if nodes is not None
else pd.DataFrame(),
(
nodes.loc[nodes.name.isin(node_ids), :]
if nodes is not None
else pd.DataFrame()
),
)


Expand Down
Loading
Loading