feat(scoring): post semantic scorer #69

teslashibe · 2025-01-04T16:51:20Z

Problem Statement

In Masa Agent Arena, agents are exhibiting low-quality conversational behaviors:

Content Repetition: Agents frequently generate near-duplicate responses across different conversations
Lack of Originality: Multiple agents produce very similar responses to the same prompts
Quality Impact: This repetitive behavior reduces the value of agent interactions and limits learning opportunities

The SemanticScorer was implemented to:

Quantify response quality through originality (60%) and uniqueness (40%) metrics
Penalize repetitive content by detecting semantic similarities
Encourage diverse and original responses by providing feedback scores
Help identify and improve underperforming agents

This scoring mechanism serves as a quality gate to drive improvement in agent response diversity and originality.

Proposed Solution

This code evaluates the quality of AI-generated posts by scoring them on two key metrics:

Originality Score (60% weight)
- Measures how different a post is from all other posts
- Higher score = less similar to other posts on average
- Calculation: 1 - average_similarity_to_other_posts
Uniqueness Score (40% weight)
- Measures how many very similar posts exist
- Higher score = fewer near-duplicate posts
- Uses threshold (0.8) to identify "similar" posts
- Calculation: 1 / (1 + number_of_similar_posts)

Process Flow

Takes a list of text posts
Converts texts into embeddings using SentenceTransformer
Creates similarity matrix comparing each post against every other post
Calculates both scores
Combines scores using weights (60% originality, 40% uniqueness)

Model Details

Model: all-MiniLM-L6-v2
Architecture: MiniLM (Knowledge Distilled from BERT)
Size: 80MB (vs BERT's 440MB)
Performance:
- Faster than BERT
- 384 dimension embeddings
- Good balance of speed/performance for semantic similarity tasks

Alternatives

If we need higher accuracy (at the cost of speed), we could use:

all-mpnet-base-v2 (more accurate but slower)
bert-base-nli-mean-tokens (original BERT)
roberta-large-nli-stsb-mean-tokens (RoBERTa)

Current model choice (MiniLM) is optimized for:

Fast inference speed
Lower memory usage
Good enough accuracy for similarity scoring

Example

scorer = SemanticScorer()
posts = [
    "The weather is nice today",
    "The weather is nice today!",  # Very similar to #1
    "AI is transforming technology",  # Completely different
]

scores = scorer.calculate_scores(posts)
# Post 1 & 2 will get lower scores (similar content)
# Post 3 will get higher score (unique content)

Use Case

This scorer helps identify:

Repetitive content
Near-duplicate posts
Diverse and original content

It's particularly useful for:

Quality control of AI outputs
Detecting when an AI is being too repetitive
Encouraging diverse responses

The final score (0-1) indicates how original and unique each post is within the given set, with higher scores being better.

The text was updated successfully, but these errors were encountered:

Luka-Loncar · 2025-01-16T17:20:02Z

Does this one still stand, considering that you have posted PR after this one? @teslashibe

This PR - not sure if its related to this. #96

teslashibe self-assigned this Jan 4, 2025

teslashibe added the enhancement New feature or request label Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scoring): post semantic scorer #69

feat(scoring): post semantic scorer #69

teslashibe commented Jan 4, 2025

Luka-Loncar commented Jan 16, 2025 •

edited

Loading

feat(scoring): post semantic scorer #69

feat(scoring): post semantic scorer #69

Comments

teslashibe commented Jan 4, 2025

Problem Statement

Proposed Solution

Process Flow

Model Details

Alternatives

Example

Use Case

Luka-Loncar commented Jan 16, 2025 • edited Loading

Luka-Loncar commented Jan 16, 2025 •

edited

Loading