Documatch

A python driven program to analyze the relevance of text documents to the given dataset of documents using TF-IDF vector

Dataset

We use a corpus of all the general election presidential debates from 1960 to 2012. We processed the corpus and provided you a .zip file, which includes 30 .txt files. Each of the 30 files contains the transcript of a debate and is named by the date of the debate.

Programming Language

-> Python 3.5.1 -> NLTK library (http://www.nltk.org/index.html)

Test Cases

query(qstring): return the document that has the highest similarity score with respect to 'qstring'. getcount(token): return the total number of occurrences of a token in all documents. getidf(token): return the inverse document frequency of a token. If the token doesn't exist in the corpus, return 0. docdocsim(filename1,filename2): return the cosine similarity betwen two speeches (files). querydocsim(qstring,filename): return the cosine similairty between a query string and a document.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
gen		gen
presidential_debates		presidential_debates
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documatch

About

Releases

Packages

Languages

neaGaze/Documatch

Folders and files

Latest commit

History

Repository files navigation

Documatch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages