Skip to content

Commit

Permalink
example GLiNER integration (#1504)
Browse files Browse the repository at this point in the history
  • Loading branch information
omri374 authored Jan 13, 2025
1 parent c01f0bd commit 7d26bac
Show file tree
Hide file tree
Showing 7 changed files with 440 additions and 46 deletions.
9 changes: 5 additions & 4 deletions docs/samples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
| Usage | Text | Python file | [Passing a lambda as a Presidio anonymizer using Faker](python/example_custom_lambda_anonymizer.py)|
| Usage | Text | Python file | [Synthetic data generation with OpenAI](python/synth_data_with_openai.ipynb)|
| Usage | Text | Python file | [Keeping some entities from being anonymized](python/keep_entities.ipynb)|
| Usage | Text | LiteLLM Proxy | [PII Masking LLM calls across Anthropic/Gemini/Bedrock/Azure, etc.](docker/litellm.md)|
| Usage | Text | LiteLLM Proxy | [PII Masking LLM calls across Anthropic/Gemini/Bedrock/Azure, etc.](docker/litellm.md)|
| Usage | Text | Python Notebook | [YAML based no-code configuration](python/no_code_config.ipynb) |
| Usage | Text | Python file | [Using GLiNER within Presidio](python/gliner.md) |
| Usage | | REST API (postman) | [Presidio as a REST endpoint](docker/index.md)|
| Deployment | | App Service | [Presidio with App Service](deployments/app-service/index.md)|
| Deployment | | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md)|
| Deployment | | Spark/Azure Databricks | [Presidio with Spark](deployments/spark/index.md)|
| Deployment | | Azure Data Factory with App Service | [ETL for small dataset](deployments/data-factory/presidio-data-factory.md#option-1-presidio-as-an-http-rest-endpoint) |
| Deployment | | Azure Data Factory with Databricks | [ETL for large datasets](deployments/data-factory/presidio-data-factory.md#option-2-presidio-on-azure-databricks) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) |
| Demo | | Streamlit app | [Create a simple demo app using Streamlit](python/streamlit/index.md)
| ADF Pipeline | | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) |
| ADF Pipeline | | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) |
| Demo | | Streamlit app | [Create a simple demo app using Streamlit](python/streamlit/index.md)
79 changes: 79 additions & 0 deletions docs/samples/python/gliner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Using GLiNER within Presidio

## What is GLiNER

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

Paper: [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)

Since GLiNER takes as input both the sentence/text and entity types, it can be used for zero-shot named entity recognition. This means that it can recognize entities that were not seen during training.

## PII Detection with GLiNER

GLiNER has a trained PII detection model: 🔍 [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) *(Apache 2.0)*

This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport_number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social_security_number`.

## Using GLiNER with Presidio

Presidio has a built-in `EntityRecognizer` for GLiNER: `GLiNERRecognizer`. This recognizer can be used to detect PII entities in text using the GLiNER model.

### Installation

To use GLiNER with Presidio, you need to install the `presidio-analyzer` with the `gliner` extra:

```bash
pip install 'presidio-analyzer[gliner]'
```

!!! note
GLiNER only supports python 3.10 and above, while Presidio supports version 3.9 and above.

### Example

```python
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import GLiNERRecognizer


# Load a small spaCy model as we don't need spaCy's NER
nlp_engine = NlpEngineProvider(
nlp_configuration={
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
)

# Create an analyzer engine
analyzer_engine = AnalyzerEngine()

# Define and create the GLiNER recognizer
entity_mapping = {
"person": "PERSON",
"name": "PERSON",
"organization": "ORGANIZATION",
"location": "LOCATION"
}

gliner_recognizer = GLiNERRecognizer(
model_name="urchade/gliner_multi_pii-v1",
entity_mapping=entity_mapping,
flat_ner=False,
multi_label=True,
map_location="cpu",
)

# Add the GLiNER recognizer to the registry
analyzer_engine.registry.add_recognizer(gliner_recognizer)

# Remove the spaCy recognizer to avoid NER coming from spaCy
analyzer_engine.registry.remove_recognizer("SpacyRecognizer")

# Analyze text
results = analyzer_engine.analyze(
text="Hello, my name is Rafi Mor, I'm from Binyamina and I work at Microsoft. ", language="en"
)

print(results)
```
85 changes: 43 additions & 42 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,48 +74,49 @@ nav:
- Presidio Structured Python API: api/structured_python.md
- REST API reference: https://microsoft.github.io/presidio/api-docs/api-docs.html" target="_blank
- Samples:
- Usage:
- Home: samples/index.md
- Text:
- Presidio Basic Usage Notebook: samples/python/presidio_notebook.ipynb
- Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb
- Configuring The NLP engine: samples/python/ner_model_configuration.ipynb
- Encrypting and Decrypting identified entities: samples/python/encrypt_decrypt.ipynb
- Getting the identified entity value using a custom Operator: samples/python/getting_entity_values.ipynb
- Anonymizing known values: samples/python/Anonymizing known values.ipynb
- Keeping some entities from being anonymized: samples/python/keep_entities.ipynb
- Integrating with external services: samples/python/integrating_with_external_services.ipynb
- Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py
- Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md
- Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py
- Using Span Marker as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py
- Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md
- Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb
- Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py
- Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb
- YAML based no-code configuration: samples/python/no_code_config.ipynb
- Data:
- Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb
- Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb
- Analyze and Anonymize CSV file: https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py
- Images:
- Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb
- Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb
- Plot custom bounding boxes: samples/python/plot_custom_bboxes.ipynb
- Example DICOM redaction evaluation: samples/python/example_dicom_redactor_evaluation.ipynb
- PDF:
- Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb
- Deployment:
- Presidio with App Service: samples/deployments/app-service/index.md
- Presidio with Kubernetes: samples/deployments/k8s/index.md
- Presidio with Spark: samples/deployments/spark/index.md
- Azure Data Factory:
- ETL using AppService/Databricks: samples/deployments/data-factory/presidio-data-factory.md
- Add Presidio as an HTTP service to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md
- Add Presidio on Databricks to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md
- PII Masking LLM calls using LiteLLM proxy: samples/docker/litellm.md
- Demo:
- Create a simple demo app using Streamlit: samples/python/streamlit/index.md

- Home: samples/index.md
- Text:
- Presidio Basic Usage Notebook: samples/python/presidio_notebook.ipynb
- Customizing Presidio Analyzer: samples/python/customizing_presidio_analyzer.ipynb
- Configuring The NLP engine: samples/python/ner_model_configuration.ipynb
- Encrypting and Decrypting identified entities: samples/python/encrypt_decrypt.ipynb
- Getting the identified entity value using a custom Operator: samples/python/getting_entity_values.ipynb
- Anonymizing known values: samples/python/Anonymizing known values.ipynb
- Keeping some entities from being anonymized: samples/python/keep_entities.ipynb
- Integrating with external services: samples/python/integrating_with_external_services.ipynb
- Remote Recognizer: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py
- Azure AI Language as a Remote Recognizer: samples/python/text_analytics/index.md
- Using Flair as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py
- Using Span Marker as an external PII model: https://github.com/microsoft/presidio/blob/main/docs/samples/python/span_marker_recognizer.py
- Using Transformers as an external PII model: samples/python/transformers_recognizer/index.md
- Using GLiNER as an external PII model: samples/python/gliner.md
- Pseudonymization (replace PII values using mappings): samples/python/pseudonymization.ipynb
- Passing a lambda as a Presidio anonymizer using Faker: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py
- Synthetic data generation with OpenAI: samples/python/synth_data_with_openai.ipynb
- YAML based no-code configuration: samples/python/no_code_config.ipynb
- Data:
- Analyzing structured / semi-structured data in batch: samples/python/batch_processing.ipynb
- Presidio Structured Basic Usage Notebook: samples/python/example_structured.ipynb
- Analyze and Anonymize CSV file: https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py
- Images:
- Redacting Text PII from DICOM images: samples/python/example_dicom_image_redactor.ipynb
- Using an allow list with image redaction: samples/python/image_redaction_allow_list_approach.ipynb
- Plot custom bounding boxes: samples/python/plot_custom_bboxes.ipynb
- Example DICOM redaction evaluation: samples/python/example_dicom_redactor_evaluation.ipynb
- PDF:
- Annotating PII in a PDF: samples/python/example_pdf_annotation.ipynb
- Deployment:
- Presidio with App Service: samples/deployments/app-service/index.md
- Presidio with Kubernetes: samples/deployments/k8s/index.md
- Presidio with Spark: samples/deployments/spark/index.md
- Azure Data Factory:
- ETL using AppService/Databricks: samples/deployments/data-factory/presidio-data-factory.md
- Add Presidio as an HTTP service to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-http.md
- Add Presidio on Databricks to your Azure Data Factory: samples/deployments/data-factory/presidio-data-factory-template-gallery-databricks.md
- PII Masking LLM calls using LiteLLM proxy: samples/docker/litellm.md
- Demo app:
- Create a simple demo app using Streamlit: samples/python/streamlit/index.md
not_in_nav : |
design.md
samples/deployments/index.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .es_nie_recognizer import EsNieRecognizer
from .es_nif_recognizer import EsNifRecognizer
from .fi_personal_identity_code_recognizer import FiPersonalIdentityCodeRecognizer
from .gliner_recognizer import GLiNERRecognizer
from .iban_recognizer import IbanRecognizer
from .in_aadhaar_recognizer import InAadhaarRecognizer
from .in_pan_recognizer import InPanRecognizer
Expand Down Expand Up @@ -96,6 +97,7 @@
"ItIdentityCardRecognizer",
"ItPassportRecognizer",
"InPanRecognizer",
"GLiNERRecognizer",
"PlPeselRecognizer",
"AzureAILanguageRecognizer",
"InAadhaarRecognizer",
Expand Down
Loading

0 comments on commit 7d26bac

Please sign in to comment.