LinTO-diarization is the LinTO service for speaker diarization, with some ability to guess the number of speakers and identify some speakers if samples of their voice are provided.
LinTO-diarization can either be used as a standalone diarization service or deployed as a micro-services.
The transcription service requires docker up and running.
The diarization only entry point in job mode are tasks posted on a Redis message broker. Futhermore, to prevent large audio from transiting through the message broker, diarization uses a shared storage folder mounted on /opt/audio.
linto-diarization can be deployed:
- As a standalone diarization service through an HTTP API.
- As a micro-service connected to a message broker.
1- First step is to build or pull the image:
git clone https://github.com/linto-ai/linto-diarization.git
cd linto-diarization
docker build . -t linto-diarization-pyannote:latest -f pyannote/Dockerfile
or
docker pull lintoai/linto-diarization-pyannote
For speaker identification, run qdrant :
docker run
-p 6333:6333 \ # Qdrant default port
-v ./qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
1- Fill the .env
An example of .env file is provided in pyannote/.envdefault.
Parameters:
Variables | Description | Example |
---|---|---|
SERVING_MODE |
(Required) Specify launch mode | http |
CONCURRENCY |
Number of worker(s) additional to the main worker | 0 | 1 | 2 | ... |
DEVICE |
Device to use for the embedding model (by default, GPU/CUDA is used if it is available, CPU otherwise) | cpu | cuda | cuda:0 |
DEVICE_CLUSTERING |
Device to use for clustering (Same as DEVICE by default) |
cpu | cuda | cuda:0 |
DEVICE_IDENTIFICATION |
Device to use for speaker identification, if it is enabled (Same as DEVICE by default) |
cpu | cuda | cuda:0 |
NUM_THREADS |
Number of threads (maximum) to use for things running on CPU | 1 | 4 | ... |
CUDA_VISIBLE_DEVICES |
GPU device index to use, when running on GPU/CUDA. We also recommend to set CUDA_DEVICE_ORDER=PCI_BUS_ID on multi-GPU machines |
0 | 1 | 2 | ... |
SPEAKER_SAMPLES_FOLDER |
(default: /opt/speaker_samples ) Folder where to find audio files for target speakers samples |
/path/to/folder |
SPEAKER_PRECOMPUTED_FOLDER |
(default: /opt/speaker_precomputed ) Folder where to store precomputed embeddings of target speakers |
/path/to/folder |
QDRANT_HOST |
Host address of the Qdrant instance | localhost |
QDRANT_PORT |
Port number for the Qdrant instance | 6333 |
QDRANT_COLLECTION |
Name of the collection in Qdrant for storing embeddings | speaker_embeddings |
QDRANT_RECREATE_COLLECTION |
Recreate collection or use existing one from mounted volume | true |
2- Run the container
This will run a container providing an http API binded on the host <HOST_SERVING_PORT>
port (for instance 8080):
docker run --rm \
-v <SHARED_FOLDER>:/opt/audio \
-p <HOST_SERVING_PORT>:80 \
--env-file .env \
linto-diarization-pyannote:latest
If you want to enable speaker identification,
you have to provide samples of the target speakers' voices,
either in separate folders with the name of the speaker as the folder name,
or in separate files with the name of the speaker as the file name.
Then the parent folder of the samples must be mounted as a volume in the container under /opt/speaker_samples
(or a custom folder set with the SPEAKER_SAMPLES_FOLDER
environment variable).
docker run ... -v <</path/to/speaker/samples/folder>>:/opt/speaker_samples
When speaker identification, if you want to use an existing collection in the volume mounted to the qdrant docker container, you can specify the environment variable QDRANT_RECREATE_COLLECTION=false
This can avoid an initialisation time at each new docker run.
You may also want to add --gpus all
to enable GPU capabilitiesn
and maybe set CUDA_VISIBLE_DEVICES
if there are several available GPU cards.
LinTO-diarization can be deployed as a micro-service using celery. Used this way, the container spawn celery worker waiting for diarization task on a message broker.
You need a message broker up and running at SERVICES_BROKER.
1- Fill the .env
An example of .env file is provided in pyannote/.envdefault.
Parameters: Parameters are the same as for the HTTP API, with the addition of the following:
Variables | Description | Example |
---|---|---|
SERVING_MODE |
(Required) Specify launch mode | task |
SERVICES_BROKER |
Service broker uri | redis://my_redis_broker:6379 |
BROKER_PASS |
Service broker password (Leave empty if there is no password) | my_password |
QUEUE_NAME |
Overide the generated queue's name (See Queue name bellow) | my_queue |
SERVICE_NAME |
Service's name | diarization-ml |
LANGUAGE |
Language code as a BCP-47 code | en-US or * or languages separated by "|" |
MODEL_INFO |
Human readable description of the model | Multilingual diarization model |
QDRANT_HOST |
Host address of the Qdrant instance | localhost |
QDRANT_PORT |
Port number for the Qdrant instance | 6333 |
QDRANT_COLLECTION |
Name of the collection in Qdrant for storing embeddings | speaker_embeddings |
QDRANT_RECREATE_COLLECTION |
Recreate collection or use existing one from mounted volume | true |
2- Fill the docker-compose.yml
#docker-compose.yml
version: '3.7'
services:
punctuation-service:
image: linto-diarization-pyannote:latest
volumes:
- /path/to/shared/folder:/opt/audio
env_file: .env
deploy:
replicas: 1
networks:
- your-net
networks:
your-net:
external: true
3- Run with docker compose
docker stack deploy --resolve-image always --compose-file docker-compose.yml your_stack
Queue name:
By default the service queue name is generated as SERVICE_NAME
.
The queue name can be overided using the QUEUE_NAME
env variable.
Service discovery:
As a micro-service, the instance will register itself in the service registry for discovery. The service information are stored as a JSON object in redis's db0 under the id service:{HOST_NAME}
.
The following information are registered:
{
"service_name": $SERVICE_NAME,
"host_name": $HOST_NAME,
"service_type": "diarization",
"service_language": $LANGUAGE,
"queue_name": $QUEUE_NAME,
"version": "1.2.0", # This repository's version
"info": $MODEL_INFO,
"last_alive": 65478213,
"concurrency": 1
}
Returns the state of the API
Method: GET
Returns "1" if healthcheck passes.
Diarization API
Input arguments are:
file
: A Wave filespeaker_count
: (integer - optional) Number of speakers. If empty, diarization will clusterize automatically.max_speaker
: (integer - optional) Max number of speakers if speaker_count is unknown.speaker_names
: (string - optional) List of target speaker names, speaker identification (if speaker samples are provided only). Possible values are- empty string "": no speaker identification
- wild card "
*
": speaker identification for all speakers - list of speaker names in json format (ex: "
["speaker1", ..., "speakerN"]
") or separated by|
(ex: "speaker1|...|speakerN
"): speaker identification for the listed speakers only
The response (application/json) is a json object when using structured as followed:
{
"speakers": [
{"spk_id": "spk5", "duration": 2.2, "nbr_seg": 1},
...
],
"segments": [
{"seg_id": 1, "spk_id": "spk5", "seg_begin": 0.0, "seg_end": 2.2},
...
]
}
The /docs route offers a OpenAPI/swagger interface.
Diarization worker accepts requests with the following arguments:
file
: (str) Is the relative path of the file in the shared_folder.speaker_count
: (int, default None) Fixed number of speakers.max_speaker
: (int, default None) Max number of speaker if speaker_count=None.speaker_names
: (string, default None) List of target speaker names, speaker identification (if speaker samples are provided only). Possible values are- empty string "": no speaker identification
- wild card "
*
": speaker identification for all speakers - list of speaker names in json format (ex: "
["speaker1", ..., "speakerN"]
") or separated by|
(ex: "speaker1|...|speakerN
"): speaker identification for the listed speakers only
On a successfull transcription the returned object is a json object structured as follow:
{
"speakers": [
{"spk_id": "spk5", "duration": 2.0, "nbr_seg": 1},
...
],
"segments": [
{"seg_id": 1, "spk_id": "spk5", "seg_begin": 0.0, "seg_end": 2.0},
...
]
}
- The
speakers
field contains an arraw of speaker with overall duration and number of segments. - The
segments
field contains each audio segment with the associated speaker id start time and end time.
You can test you http API using curl:
curl -X POST "http://YOUR_SERVICE:PORT/diarization" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@YOUR_FILE.wav;type=audio/x-wav" -F "speaker_count=NUMBER_OF_SPEAKERS"
This project is developped under the AGPLv3 License (see LICENSE).
- PyAnnote diarization framework (License MIT).