Trigram Language Model Project

Overview

This project implements a trigram language model for character-level language modeling. Models are trained respectively on English, German, and Spanish datasets using various smoothing techniques and sampling methods. The project includes methods for model training, text generation, perplexity calculation, and model optimization.

File Structure

model.py: Defines the Trigram class, handling model initialization, training, text generation, and perplexity calculation.
main.py: Runs the complete pipeline, including model training, text generation with various sampling methods, testing, and optimization.
optimize.py: Provides the optimize_a function to find the best smoothing parameter a that minimizes perplexity.
smoothing.py: Implements Good-Turing smoothing to adjust trigram counts for unseen events.
sampling.py: Offers different text sampling methods, including maximum likelihood, top-k, top-p, and weighted random generation.
utils.py: Contains helper functions for preprocessing text, reading data files, and splitting datasets for training and validation.

Requirements

Python 3.9 (Recommended to run in PyCharm)

Folder Setup

To get started, ensure that the following folder structure is in place:

data/: Place the training datasets (training.de, training.en, training.es) in this folder.
model/: Place the initial model file model-br.en here. Trained models will also be saved in this folder.
output/: This folder should be empty initially. Generated output files, such as perplexity optimization plots, will be saved here.

Running Extra Questions

If you need to run the extra question (Q6), make sure the data/test-port file is included in the data/ folder. The content for this file is provided at the end of the Appendix.

How to Run

Copy the provided code into the appropriate files (main.py, model.py, optimize.py, etc.).
Ensure the data/, model/, and output/ folders are correctly set up as described above.
To run the project and get the results for Q1-Q5, simply run the main.py file. This will train the models, perform optimization, and output the necessary results.
```
python main.py
```
To run the extra question, Good-Turing smoothing, or other sampling methods, uncomment the relevant code in main.py before running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trigram Language Model Project

Overview

File Structure

Requirements

Folder Setup

Running Extra Questions

How to Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
data		data
docs		docs
model		model
output		output
README.md		README.md
main.py		main.py
model.py		model.py
optimize.py		optimize.py
sampling.py		sampling.py
smoothing.py		smoothing.py
test.py		test.py
utils.py		utils.py

xiangyueerli/Trigram

Folders and files

Latest commit

History

Repository files navigation

Trigram Language Model Project

Overview

File Structure

Requirements

Folder Setup

Running Extra Questions

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages