Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add z-score for the normalization processor #376 #470

Open
wants to merge 12 commits into
base: feature/z-score-normalization
Choose a base branch
from

Conversation

sam-herman
Copy link

Description

This change implements #376

  • Add z-score for hybrid query normalization processor
  • Add IT that test normalization end to end

Issues Resolved

Resolving #376

Check List

  • [x ] New functionality includes testing.
    • [ x] All tests pass
  • [x ] New functionality has been documented.
    • [ x] New functionality has javadoc added
  • [ x] Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
@sam-herman
Copy link
Author

re opening the PR previously at #468, but this time against the feature branch instead of main.

@sam-herman
Copy link
Author

Hi @navneet1v @martin-gaievski @heemin32 this is the new PR that opened this time against the feature branch, feel free to continue providing your feedback here as I closed the original PR.
This PR should hopefully address all your comments from the previous PR.

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of generic things:

  • did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.
  • please fix all CI checks, you can simulate them by running gradle check locally.

new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new ScoreDoc[0]),
new TopDocs(
new TotalHits(3, TotalHits.Relation.EQUAL_TO),
new ScoreDoc[] { new ScoreDoc(3, 0.98058068f), new ScoreDoc(4, 0.39223227f), new ScoreDoc(2, -1.37281295f) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a simple formula or a method as part of the code comments, so we can understand how that score calculated out of provided individual scores. Having a reference to a method description is good, but not the same. Something like you added for integ test assertions will be good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sam-herman
Copy link
Author

sam-herman commented Nov 15, 2023

Couple of generic things:

  • did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.
  • please fix all CI checks, you can simulate them by running gradle check locally.

@martin-gaievski my results look as follows, overall z-score is showing best even among hybrid query results.
It seems to support the results from the blog https://towardsdatascience.com/hybrid-search-2-0-the-pursuit-of-better-search-ce44d6f20c08
I can do more experiments, but do you think that would be sufficient for now for us to include it?

For BM25 as baseline

2023-11-14 15:52:31 - NDCG@5: 0.6347
2023-11-14 15:52:31 - NDCG@10: 0.6563
2023-11-14 15:52:31 - NDCG@100: 0.6810
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - MAP@5: 0.6018
2023-11-14 15:52:31 - MAP@10: 0.6119
2023-11-14 15:52:31 - MAP@100: 0.6179
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - Recall@5: 0.7154
2023-11-14 15:52:31 - Recall@10: 0.7790
2023-11-14 15:52:31 - Recall@100: 0.8842
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - P@5: 0.1560
2023-11-14 15:52:31 - P@10: 0.0853
2023-11-14 15:52:31 - P@100: 0.0100

For Neural search

2023-11-14 15:54:19 - NDCG@5: 0.5747
2023-11-14 15:54:19 - NDCG@10: 0.6073
2023-11-14 15:54:19 - NDCG@100: 0.6381
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - MAP@5: 0.5368
2023-11-14 15:54:19 - MAP@10: 0.5512
2023-11-14 15:54:19 - MAP@100: 0.5585
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - Recall@5: 0.6711
2023-11-14 15:54:19 - Recall@10: 0.7693
2023-11-14 15:54:19 - Recall@100: 0.9067
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - P@5: 0.1480
2023-11-14 15:54:19 - P@10: 0.0867
2023-11-14 15:54:19 - P@100: 0.0103
p50: 33.5
p90: 43.0
p99: 49.52999999999997

For min-max hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:57:04 - NDCG@5: 0.6449
2023-11-14 15:57:04 - NDCG@10: 0.6757
2023-11-14 15:57:04 - NDCG@100: 0.7042
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - MAP@5: 0.6113
2023-11-14 15:57:04 - MAP@10: 0.6249
2023-11-14 15:57:04 - MAP@100: 0.6311
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - Recall@5: 0.7257
2023-11-14 15:57:04 - Recall@10: 0.8194
2023-11-14 15:57:04 - Recall@100: 0.9477
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - P@5: 0.1600
2023-11-14 15:57:04 - P@10: 0.0917
2023-11-14 15:57:04 - P@100: 0.0107
p50: 51.0
p90: 69.10000000000002
p99: 101.1099999999999

For Zscore Hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:59:06 - NDCG@5: 0.6518
2023-11-14 15:59:06 - NDCG@10: 0.6710
2023-11-14 15:59:06 - NDCG@100: 0.7052
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - MAP@5: 0.6105
2023-11-14 15:59:06 - MAP@10: 0.6204
2023-11-14 15:59:06 - MAP@100: 0.6291
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - Recall@5: 0.7561
2023-11-14 15:59:06 - Recall@10: 0.8100
2023-11-14 15:59:06 - Recall@100: 0.9543
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - P@5: 0.1653
2023-11-14 15:59:06 - P@10: 0.0903
2023-11-14 15:59:06 - P@100: 0.0108
p50: 47.5
p90: 60.60000000000002
p99: 69.00999999999999

Note: Edited to reformat the results into a table

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

@martin-gaievski
Copy link
Member

Couple of generic things:

  • did you test accuracy and performance of your solution? For our implementation we used beir challenge framework with some custom scripts, ideally results should look something like in a blog post where the feature has been announced.
  • please fix all CI checks, you can simulate them by running gradle check locally.

@martin-gaievski my results look as follows, overall z-score is showing best even among hybrid query results. It seems to support the results from the blog https://towardsdatascience.com/hybrid-search-2-0-the-pursuit-of-better-search-ce44d6f20c08 I can do more experiments, but do you think that would be sufficient for now for us to include it?

For BM25 as baseline

2023-11-14 15:52:31 - NDCG@5: 0.6347
2023-11-14 15:52:31 - NDCG@10: 0.6563
2023-11-14 15:52:31 - NDCG@100: 0.6810
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - MAP@5: 0.6018
2023-11-14 15:52:31 - MAP@10: 0.6119
2023-11-14 15:52:31 - MAP@100: 0.6179
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - Recall@5: 0.7154
2023-11-14 15:52:31 - Recall@10: 0.7790
2023-11-14 15:52:31 - Recall@100: 0.8842
2023-11-14 15:52:31 - 

2023-11-14 15:52:31 - P@5: 0.1560
2023-11-14 15:52:31 - P@10: 0.0853
2023-11-14 15:52:31 - P@100: 0.0100

For Neural search

2023-11-14 15:54:19 - NDCG@5: 0.5747
2023-11-14 15:54:19 - NDCG@10: 0.6073
2023-11-14 15:54:19 - NDCG@100: 0.6381
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - MAP@5: 0.5368
2023-11-14 15:54:19 - MAP@10: 0.5512
2023-11-14 15:54:19 - MAP@100: 0.5585
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - Recall@5: 0.6711
2023-11-14 15:54:19 - Recall@10: 0.7693
2023-11-14 15:54:19 - Recall@100: 0.9067
2023-11-14 15:54:19 - 

2023-11-14 15:54:19 - P@5: 0.1480
2023-11-14 15:54:19 - P@10: 0.0867
2023-11-14 15:54:19 - P@100: 0.0103
p50: 33.5
p90: 43.0
p99: 49.52999999999997

For min-max hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:57:04 - NDCG@5: 0.6449
2023-11-14 15:57:04 - NDCG@10: 0.6757
2023-11-14 15:57:04 - NDCG@100: 0.7042
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - MAP@5: 0.6113
2023-11-14 15:57:04 - MAP@10: 0.6249
2023-11-14 15:57:04 - MAP@100: 0.6311
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - Recall@5: 0.7257
2023-11-14 15:57:04 - Recall@10: 0.8194
2023-11-14 15:57:04 - Recall@100: 0.9477
2023-11-14 15:57:04 - 

2023-11-14 15:57:04 - P@5: 0.1600
2023-11-14 15:57:04 - P@10: 0.0917
2023-11-14 15:57:04 - P@100: 0.0107
p50: 51.0
p90: 69.10000000000002
p99: 101.1099999999999

For Zscore Hybrid (weights 0.4, 0.3, 0.3):

2023-11-14 15:59:06 - NDCG@5: 0.6518
2023-11-14 15:59:06 - NDCG@10: 0.6710
2023-11-14 15:59:06 - NDCG@100: 0.7052
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - MAP@5: 0.6105
2023-11-14 15:59:06 - MAP@10: 0.6204
2023-11-14 15:59:06 - MAP@100: 0.6291
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - Recall@5: 0.7561
2023-11-14 15:59:06 - Recall@10: 0.8100
2023-11-14 15:59:06 - Recall@100: 0.9543
2023-11-14 15:59:06 - 

2023-11-14 15:59:06 - P@5: 0.1653
2023-11-14 15:59:06 - P@10: 0.0903
2023-11-14 15:59:06 - P@100: 0.0108
p50: 47.5
p90: 60.60000000000002
p99: 69.00999999999999

Note: Edited to reformat the results into a table

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

@samuel-oci That looks reasonable. Can you please add more info:

  • exact queries you've used for BM25 and hybrid
  • dataset(s)
  • model, was is generic or fine-tuned
  • maybe scripts that you've used to run the benchmark

@sam-herman
Copy link
Author

sam-herman commented Nov 16, 2023

Note: Edited to reformat the results into a table
Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

@samuel-oci That looks reasonable. Can you please add more info:

  • exact queries you've used for BM25 and hybrid
  • dataset(s)
  • model, was is generic or fine-tuned
  • maybe scripts that you've used to run the benchmark

Sure @martin-gaievski
repasting results:

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6347 0.6563 0.7042 0.6018 0.6119 0.6179 0.7154 0.7790 0.8842 0.1560 0.0853 0.0100
Neural 0.5747 0.6073 0.6381 0.5368 0.5512 0.5585 0.6711 0.7693 0.9067 0.1480 0.0867 0.0103
Hybrid (min-max norm) 0.6449 0.6757 0.7042 0.6113 0.6249 0.6311 0.7257 0.8194 0.9477 0.1600 0.0917 0.0107
Hybrid (z-score norm) 0.6518 0.6710 0.7052 0.6105 0.6204 0.6291 0.7561 0.8100 0.9543 0.1653 0.0903 0.0108

Dataset: Scifact
Queries: same as here I made minor modifications just to get the project to run properly but didn't change the queries (will share that as well after some cleanup)
Model: generic pre-trained all-MiniLM-L12-v2

scripts:

PORT=50365
HOST=localhost
URL="$HOST:$PORT"

curl -XPUT -H "Content-Type: application/json" $URL/_ingest/pipeline/nlp-pipeline -d '
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "AXA30IsByAqY8FkWHdIF",
        "field_map": {
           "passage_text": "passage_embedding"
        }
      }
    }
  ]
}'

curl -XDELETE $URL/scifact

curl -XPUT -H "Content-Type: application/json" $URL/scifact -d '
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name":"hnsw",
                    "engine":"lucene",
                    "space_type": "l2",
                    "parameters":{
                        "m":16,
                        "ef_construction": 512
                    }
                }
            },
            "passage_text": { 
                "type": "text"            
            },
            "passage_key": { 
                "type": "text"            
            },
            "passage_title": { 
                "type": "text"            
            }
        }
    }
}'

curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-minmax-pipeline-hybrid -d '
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.4,
              0.3,
              0.3
            ]
          }
        }
      }
    }
  ]
}'

curl -XPUT -H "Content-Type: application/json" $URL/_search/pipeline/norm-zscore-pipeline-hybrid -d '
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "z_score"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.4,
              0.3,
              0.3
            ]
          }
        }
      }
    }
  ]
}'

To use later with

PORT=50365
MODEL_ID="AXA30IsByAqY8FkWHdIF"
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=ingest
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=bm25
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=neural --pipelines=norm-minmax-pipeline --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-minmax-pipeline-hybrid --os_model_id=$MODEL_ID
pipenv run python test_opensearch.py --dataset=scifact --dataset_url="https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip" --os_host=localhost --os_port=$PORT --os_index="scifact" --operation=evaluate --method=hybrid --pipelines=norm-zscore-pipeline-hybrid --os_model_id=$MODEL_ID

@sam-herman
Copy link
Author

sam-herman commented Nov 17, 2023

FYI: I noticed a few of the IT tests (which were not changed in this PR) are broken after merge with the upstream feature branch.

@martin-gaievski
Copy link
Member

@samuel-oci thank you for sharing details of the benchmark. It's not exactly what we have used to run benchmarks from our side. Is it possible for you to adjust some things and run one more round? This way we can compare apples to apples your numbers with those we've got before. Here is the list of what needs to be adjusted:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 4
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}
  • search pipeline with default weights, like below:
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}'

@sam-herman
Copy link
Author

@samuel-oci thank you for sharing details of the benchmark. It's not exactly what we have used to run benchmarks from our side. Is it possible for you to adjust some things and run one more round? This way we can compare apples to apples your numbers with those we've got before. Here is the list of what needs to be adjusted:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 4
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}
  • search pipeline with default weights, like below:
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}'

Sure thing @martin-gaievski , I reproduced the results with the settings you suggested above and used trec-covid for dataset. Moreover, I included L2 normalization this time as well for additional reference with min-max and z-score. Got the following results:

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.7336 0.6850 0.4734 0.0093 0.0163 0.0798 0.0100 0.0185 0.1120 0.7800 0.7260 0.4916
Neural 0.5179 0.4799 0.3509 0.0060 0.0104 0.0477 0.0072 0.0131 0.0781 0.5760 0.5200 0.3604
Hybrid min-max 0.7497 0.7263 0.4968 0.0099 0.0176 0.0851 0.0104 0.0197 0.1136 0.8000 0.7720 0.5046
Hybrid l2 0.7398 0.7150 0.4919 0.0096 0.0171 0.0830 0.0101 0.0193 0.1134 0.7840 0.7640 0.4992
hybrid z-score 0.6867 0.6467 0.4382 0.0086 0.0150 0.0710 0.0095 0.0173 0.1027 0.7360 0.6800 0.4456

Signed-off-by: Samuel Herman <[email protected]>
Copy link

codecov bot commented Nov 20, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (46499fa) 84.37% compared to head (9a19fe7) 84.34%.

❗ Current head 9a19fe7 differs from pull request most recent head 4843b7b. Consider uploading reports for the commit 4843b7b to get more accurate results

Files Patch % Lines
...or/normalization/ZScoreNormalizationTechnique.java 84.05% 5 Missing and 6 partials ⚠️
Additional details and impacted files
@@                         Coverage Diff                         @@
##             feature/z-score-normalization     #470      +/-   ##
===================================================================
- Coverage                            84.37%   84.34%   -0.03%     
- Complexity                             498      523      +25     
===================================================================
  Files                                   40       41       +1     
  Lines                                 1491     1559      +68     
  Branches                               228      247      +19     
===================================================================
+ Hits                                  1258     1315      +57     
- Misses                                 133      138       +5     
- Partials                               100      106       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martin-gaievski
Copy link
Member

@samuel-oci from the data you've provided for trec-covid dataset it seems that z-score performing not that great comparing to other techniques. We're mainly looking to the NDCG metric to compare score accuracy performance.

We need more information/datapoints to understand z-score performance better. Can you please run same test for other datasets mentioned in the blog https://opensearch.org/blog/hybrid-search/? Idea is to find if z-score performing better than min-max and l2 for any of the datasets. If that's the case we need to find what is specific about that dataset(s) so z-score performing better. If we cannot find such dataset then we'll need to rethink if we want to add this technique or not.

This is the list of datasets we're used:

  • NFCorpus
  • Trec-Covid
  • Scidocs
  • Quora
  • Amazon ESCI
  • DBPedia
  • FiQA

I think DBPedia can be a problem due to large size => longest time to ingest data, so you can skip it, the rest should be doable.

Another point - I reviews the configuration I've shared with you previously, there is one adjustment you'll need to make.

This is the mapping we used in our benchmarking:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline",
        "number_of_shards": 12
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "space_type": "innerproduct",
                    "parameters": {}
                }
            },
            "passage_text": {
                "type": "text"
            },
            "title_key": {
                "type": "text", "analyzer" : "english"
            },
            "text_key": {
                "type": "text", "analyzer" : "english"
            }
        }
    }
}

there are 12 shards in the configuration, but we also used 3 data nodes, so I've gave number of shards to 4. If you want to recreate our setup exactly, then you need 12 shards on 3 nodes. But in our case we were doing that to measure latencies, seems that you have good numbers there so it's not a concern.
To summarize - please change number of shards to 12. You can keep 1 data node, or make it 3 - up to you.

We're working on adding all these details to a separate issue to formalize the intake process for new techniques, that's work in progress now: #444

@sam-herman
Copy link
Author

sam-herman commented Nov 21, 2023

https://opensearch.org/blog/hybrid-search/

@martin-gaievski , while the PR is mostly trivial I think the main issue that I see here is actually the time and effort it takes to setup and reproduce results for overall quite small datasets and workloads that shouldn't require an external environment.
I have some ideas on how to address those efficiently.
Will continue this discussion on the benchmark framework for neural search in this thread:
#430

Regarding the benchmark itself, we would also require to change combiner logic to something more z-score friendly. current combination techniques have some limitations because they only support greater than 0 score.

here is an example for the same benchmark on scifact dataset, this time I also added the combiner that can take into account negative values for z-score in arithmetic mean (proper z-score combiner should not be arithmetic mean whether with negatives or not but we can use it as an approximation for now).
This one is showing advantage to the z-score normalization approach. I can get the rest of the datasets as well, but for now can we have this dataset benchmark (it's part of the BEIR datasets) as a sufficient justification?

Method NDCG@5 NDCG@10 NDCG@100 MAP@5 MAP@10 MAP@100 Recall@5 Recall@10 Recall@100 P@5 P@10 P@100
BM25 0.6577 0.6809 0.7036 0.6211 0.6327 0.6382 0.7479 0.8131 0.9109 0.1620 0.0900 0.0103
Neural 0.5446 0.5615 0.5946 0.5134 0.5219 0.5292 0.6177 0.6654 0.8192 0.1367 0.0753 0.0093
Hybrid min-max 0.6248 0.6483 0.6790 0.5867 0.5979 0.6058 0.7178 0.7861 0.9177 0.1573 0.0880 0.0104
Hybrid l2 0.6220 0.6379 0.6723 0.5854 0.5931 0.6003 0.7136 0.7598 0.9192 0.1553 0.0843 0.0104
hybrid z-score (default arithmetic mean combiner) 0.6475 0.6705 0.6991 0.6100 0.6217 0.6282 0.7421 0.8044 0.9343 0.1607 0.0900 0.0106
hybrid z-score (arithmetic mean combiner with negatives) 0.6595 0.6770 0.7045 0.6182 0.6275 0.6344 0.7644 0.8111 0.9327 0.9327 0.1660 0.0907

@martin-gaievski
Copy link
Member

https://opensearch.org/blog/hybrid-search/

@martin-gaievski , while the PR is mostly trivial I think the main issue that I see here is actually the time and effort it takes to setup and reproduce results for overall quite small datasets and workloads that shouldn't require an external environment. I have some ideas on how to address those efficiently. Will continue this discussion on the benchmark framework for neural search in this thread: #430

Regarding the benchmark itself, we would also require to change combiner logic to something more z-score friendly. current combination techniques have some limitations because they only support greater than 0 score.

here is an example for the same benchmark on scifact dataset, this time I also added the combiner that can take into account negative values for z-score in arithmetic mean (proper z-score combiner should not be arithmetic mean whether with negatives or not but we can use it as an approximation for now). This one is showing advantage to the z-score normalization approach. I can get the rest of the datasets as well, but for now can we have this dataset benchmark (it's part of the BEIR datasets) as a sufficient justification?

@samuel-oci I agree that testing for such change is the main effort timewise, but that is absolute necessity. Main point for it is - we need to have a strong point of why we're adding it. My view is that what we do have now is a baseline, and anything we're adding after that should be compared and added if it works better for some/all cases. As soon as it became part of the codebase it can be used by any customer and it's a maintainer's repressibility to respond on requests such: "is/when this technique is better than technique X?"

Main datapoint we're looking for looking now is: how z-score results are better/worse than min/max and l2 on different datasets. For scifact it was shown before that z-score gives better NDCG, there isn't a new thing. I would love to see results for trec-covid, in the benchmark you've shared before the z-score performed worse than both min-max and l2. Would that adjusted combiner be a tipping point that is a question.

Did you check how combiner for negative scores affect the score from min-max and l2? If those scores remain same we can make such combiner a default one, or you can make it flexible if z-score only shows good results with such combiner.

@sam-herman
Copy link
Author

@samuel-oci I agree that testing for such change is the main effort timewise, but that is absolute necessity. Main point for it is - we need to have a strong point of why we're adding it. My view is that what we do have now is a baseline, and anything we're adding after that should be compared and added if it works better for some/all cases. As soon as it became part of the codebase it can be used by any customer and it's a maintainer's repressibility to respond on requests such: "is/when this technique is better than technique X?"

Hi @martin-gaievski that makes sense to me, if I understand your point here there are two things you are looking for:

  1. which use cases are going to be supported better by z-score normalization
  2. leave sufficient documentation to make it easy for end user to understand

Main datapoint we're looking for looking now is: how z-score results are better/worse than min/max and l2 on different datasets. For scifact it was shown before that z-score gives better NDCG, there isn't a new thing. I would love to see results for trec-covid, in the benchmark you've shared before the z-score performed worse than both min-max and l2. Would that adjusted combiner be a tipping point that is a question.

For how many data sets should we benchmark this? Or is it more fluid limit of as long as it takes to find the answer to the previous two questions?

Did you check how combiner for negative scores affect the score from min-max and l2? If those scores remain same we can make such combiner a default one, or you can make it flexible if z-score only shows good results with such combiner.

I didn't check new combiner code yet on l2, min-max I was hoping to contribute in parts and leave combiner for later.
Currently I don't expect the geometric mean with negative values combiner to have an effect on existing techniques (min-max and l2) but it broke some tests which is why I decided to not do it for now.
If we do want to test with appropriate z-score combiner I will have to create a specific combiner for z-score that can combine two z-scores to a new z-score in the following way:
https://stats.stackexchange.com/questions/348192/combining-z-scores-by-weighted-average-sanity-check-please
I can try it out and add that in the benchmark as well.

@martin-gaievski
Copy link
Member

@samuel-oci are you planning to continue work on this feature? Checking as there were no activity for last couple of weeks.

@sam-herman
Copy link
Author

@samuel-oci are you planning to continue work on this feature? Checking as there were no activity for last couple of weeks.

@martin-gaievski yes, just added the commits that include the scripts used for testing as well, hopefully those will benefit others too when using BEIR for neural-search testing.

It's been a bit of a challenge to get my hands on hardware that will allow me to run the tests on the larger datasets in BEIR, but I think I should be able to get those numbers soon.

@martin-gaievski martin-gaievski added the Enhancements Increases software capabilities beyond original client specifications label Aug 7, 2024
@jmazanec15
Copy link
Member

@samuel-oci is this still being worked on?

@minalsha
Copy link
Collaborator

Hi @samuel-oci is this still being worked on?

@sam-herman
Copy link
Author

No not working at this anymore, feel free to close this.

@vibrantvarun
Copy link
Member

Next action items: Deep dive needed on search relevance and perform benchmarking results on some more datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants