Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code refactoring #3

Merged
merged 5 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
**
!requirements.txt
!es_test_data.py
!search_test.py
2 changes: 1 addition & 1 deletion .drone.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ steps:
- pip3 install -r requirements.txt
- sleep 30
- curl -s http://es:9200
- "python3 es_test_data.py -es_url=http://es:9200"
- "python3 search_test.py -es_url=http://es:9200"
when:
event:
- pull_request
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ jobs:
curl -s http://localhost:9200

- name: Run tests
run: python3 es_test_data.py --count=1000 --es_url=http://localhost:9200
run: python3 search_test.py --count=2000 --search_db_url=http://localhost:9200

5 changes: 3 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ RUN pip install --ignore-installed --no-warn-script-location --prefix="/dist" -r

WORKDIR /dist/

COPY es_test_data.py .
COPY modules ./modules/
COPY search_test.py .

# For debugging the Build Stage
CMD ["bash"]
Expand Down Expand Up @@ -79,4 +80,4 @@ USER "$APP_USER_NAME"
COPY --from=build --chown="$APP_USER_NAME":"$APP_GROUP_ID" /dist/ "$PYTHONUSERBASE"/

# Use ENTRYPOINT instead CMD to force the container to start the application
ENTRYPOINT ["python", "es_test_data.py"]
ENTRYPOINT ["python", "search_test.py"]
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
The MIT License (MIT)

Copyright (c) 2015 Oliver
Copyright (c) 2024 Codethink

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -19,4 +20,3 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

166 changes: 116 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,57 @@
# Quick fix for opensearch 2.6
- fix index creation
- remove doc type

# Elasticsearch For Beginners: Generate and Upload Randomized Test Data

Because everybody loves test data.

## Ok, so what is this thing doing?

`es_test_data.py` lets you generate and upload randomized test data to
your ES cluster so you can start running queries, see what performance
`search_test.py` lets you generate and upload randomized test data to
your Elasticsearch or Opensearch cluster so you can start running queries, see what performance
is like, and verify your cluster is able to handle the load.

It allows for easy configuring of what the test documents look like, what
kind of data types they include and what the field names are called.

## Cool, how do I use this?
## Cool, how do I use this?

### Run Python script

Let's assume you have an Elasticsearch cluster running.
Let's assume you have an Elasticsearch or Opensearch cluster running.

Python and [Tornado](https://github.com/tornadoweb/tornado/) are used. Run
`pip install tornado` to install Tornado if you don't have it already.
Python, [Tornado](https://github.com/tornadoweb/tornado/) and [Faker](https://github.com/joke2k/faker) are used. Run
`pip install tornado` and `pip install Faker` to install if you don't have them already.

It's as simple as this:

```text
$ python3 search_test.py

***Start Data Generate Test***
Trying to create index http://localhost:9200/test_data
Creating index test_data done {'acknowledged': True, 'shards_acknowledged': True, 'index': 'test_data'}
Generating 2000 docs, upload batch size is 1000
Upload: OK - upload took: 412ms, total docs uploaded: 1000
Upload: OK - upload took: 197ms, total docs uploaded: 2000
Done - total docs uploaded: 2000, took 1 seconds
***Start Query All Test***
Total hits: 2000, Total pages: 1
Retrieved page 1 of 1
Scroll context cleared successfully
Total Querying time taken: 85.00ms
***Start Delete Index***
Deleting index 'test_data' done
```
$ python es_test_data.py --es_url=http://localhost:9200
[I 150604 15:43:19 es_test_data:42] Trying to create index http://localhost:9200/test_data
[I 150604 15:43:19 es_test_data:47] Guess the index exists already
[I 150604 15:43:19 es_test_data:184] Generating 10000 docs, upload batch size is 1000
[I 150604 15:43:19 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 1000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 2000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 3000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 18ms, total docs uploaded: 4000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 27ms, total docs uploaded: 5000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 6000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 15ms, total docs uploaded: 7000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 24ms, total docs uploaded: 8000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 32ms, total docs uploaded: 9000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 31ms, total docs uploaded: 10000
[I 150604 15:43:20 es_test_data:216] Done - total docs uploaded: 10000, took 1 seconds
[I 150604 15:43:20 es_test_data:217] Bulk upload average: 23 ms
[I 150604 15:43:20 es_test_data:218] Bulk upload median: 24 ms
[I 150604 15:43:20 es_test_data:219] Bulk upload 95th percentile: 31 ms
```


Without any command line options, it will generate and upload 1000 documents
of the format

```
```json
{
"name":<<str>>,
"age":<<int>>,
"last_updated":<<ts>>
"@timestamp":<<tstxt>>
}
```

to an Elasticsearch cluster at `http://localhost:9200` to an index called
`test_data`.

Expand All @@ -70,7 +65,7 @@ Requires [Docker](https://docs.docker.com/get-docker/) for running the app and [
```
1. Clone this repository
```bash
$ git clone https://github.com/oliver006/elasticsearch-test-data.git
$ git clone <change_this_to_repository_url>
$ cd elasticsearch-test-data
```
1. Run the ElasticSearch stack
Expand All @@ -92,38 +87,106 @@ Requires [Docker](https://docs.docker.com/get-docker/) for running the app and [

## Not bad but what can I configure?

`python es_test_data.py --help` gives you the full set of command line
ptions, here are the most important ones:

- `--es_url=http://localhost:9200` the base URL of your ES node, don't
include the index name
- `--username=<username>` the username when basic auth is required
- `--password=<password>` the password when basic auth is required
`python search-test.py --help` also gives you the full set of command line options,
here are more description about the most important ones:

- `action`: [generate_data, query_all, custom_query, delete_index, all] choose one
- generate_data: upload the data generated through `format` to the OpenSearch database.
- query_all: request all values of the specified index within the range using `start_time` and `finish_time`. For this test, you need to generate time data using `@timestamp` for it to be callable. If you don't want to use that key, use `custom_query`.
- custom_query: You can specify the values for the body used in the request through a JSON file. this option require `json_path`. For more information read the api documentation - [ElasticSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#run-an-es-search) [OpenSearch](https://opensearch.org/docs/latest/api-reference/search/)
- delete_index: All data at the specified index will be deleted. (Please use with caution.)
- all: (default) Conduct whole process test.(generate_data -> query_all -> delete_index)
- For authentication to the server, the following options are available:
- `validate_cert` : SSL validate_cert for requests. Use false for self-signed certificates
- Certificate based auth:
- `client_cert` : filepath of CA certificates in PEM format
- `client_key` : filepath of for client SSL key
- Username based auth:
- `username` : the username when basic auth is required
- `password` : the password when basic auth is required
- `--search_db_url=http://localhost:9200` the base URL of your search DB node, don't include the index name
- `--count=###` number of documents to generate and upload
- `--index_name=test_data` the name of the index to upload the data to.
If it doesn't exist it'll be created with these options
- `--num_of_shards=2` the number of shards for the index
- `--num_of_replicas=0` the number of replicas for the index
- `--batch_size=###` we use bulk upload to send the docs to ES, this option
controls how many we send at a time
- `--batch_size=###` we use bulk upload to send the docs to DB, this option controls how many we send at a time
- `--force_init_index=False` if `True` it will delete and re-create the index
- `--dict_file=filename.dic` if provided the `dict` data type will use words
from the dictionary file, format is one word per line. The entire file is
loaded at start-up so be careful with (very) large files.
- `--data_file=filename.json|filename.csv` if provided all data in the filename will be inserted into es. The file content has to be an array of json objects (the documents). If the file ends in `.csv` then the data is automatically converted into json and inserted as documents.

## What about the document format?
### All configuration

| Setting | Description | Default Value |
| ----------------------- | ---------------------------------------------------------------------- | ----------------------- |
| action | Specify the action to be performed. | all |
| json_path | Query JSON file path | None |
| batch_size | bulk index batch size | 1000 |
| client_cert | Filepath of CA certificates in PEM format | None |
| client_key | Filepath of client SSL key | None |
| count | Number of docs to generate | 1000 |
| data_file | Name of the documents file to use | None |
| dict_file | Name of dictionary file to use | None |
| finish_time | Shape Finish Time in '%Y-%m-%d %H:%M:%S' format | None |
| force_init_index | Force deleting and re-initializing the index | False |
| format | Message format | (truncated for brevity) |
| http_upload_timeout | Timeout in seconds when uploading data | 10 |
| id_type | Type of 'id' to use for the docs, int or uuid4 | None |
| index_name | Name of the index to store your messages | test_data |
| index_type | Index type | test_type |
| number_of_replicas | Number of replicas | 1 |
| number_of_shards | Number of shards | 1 |
| search_db_url | URL of your DB | http://localhost:9200 |
| out_file | Write test data to out_file as well | False |
| password | Password for DB | None |
| random_seed | Random seed number for Faker | None |
| set_refresh | Set refresh rate to -1 before starting the upload | False |
| start_time | Shape Start Time in '%Y-%m-%d %H:%M:%S' format | None |
| username | Username for DB | None |
| validate_cert | SSL validate_cert for requests. Use false for self-signed certificates | True |

### How to setup config file

Recommended method for Config is create `server.conf` file and input the values needed.

However, when there are many values to set, it is much more convenient to create and use a `server.conf` file.

Enter the desired options in the `server.conf` file.

Example:

Create the configure file

```shell
cd ${REPOSITORY}/elasticsearch-test-data
touch server.conf
${EDITOR} server.conf
```

Edit configure file

```conf
# server.conf
action = "all"
opensearch_url = "https://uri.for.search.db:port"
username = TEST_NAME
password = TEST_PASSWORD
```

### What about the document format?

Glad you're asking, let's get to the doc format.

The doc format is configured via `--format=<<FORMAT>>` with the default being
`name:str,age:int,last_updated:ts`.
`name:str,age:int,@timestamp:tstxt`.

The general syntax looks like this:

`<<field_name>>:<<field_type>>,<<field_name>>::<<field_type>>, ...`

For every document, `es_test_data.py` will generate random values for each of
For every document, `search_test.py` will generate random values for each of
the fields configured.

Currently supported field types are:
Expand All @@ -146,16 +209,19 @@ Currently supported field types are:
given list of `-` seperated words, the words are optional defaulting to
`text1` `text2` and `text3`, min and max are optional, defaulting to `1`
and `1`
- `arr:[array_length_expression]:[single_element_format]` an array of entries
with format specified by `single_element_format`. `array_length_expression`
can be either a single number, or pair of numbers separated by `-` (i.e. 3-7),
- `arr:[array_length_expression]:[single_element_format]` an array of entries
with format specified by `single_element_format`. `array_length_expression`
can be either a single number, or pair of numbers separated by `-` (i.e. 3-7),
defining range of lengths from with random length will be picked for each array
(Example `int_array:arr:1-5:int:1:250`)

- `log_version` a random version `str` looks like v1.1.1
- `sha` generate random sha(len 40)
- `file_name` Generate fake file `file_name:.py`
- `uuid` Generate fake uuid
- `systemd` Generate fake systemd name

## Todo

- document the remaining cmd line options
- more different format types
- ...

Expand Down
Loading
Loading