TopicModels

This repository contains code for NLP topic models like CORRLDA2, ECTM. These topic models are applied on hindi news articles collected frm Amar Ujala website. A brief overview of the work flow is shown below:

DataCollection

This repository contains java code to crawl web pages from "Amar Ujala" newspaper website. The code uses "Crawler4j" (3rd party library) and "Jsoup" library to parse "Hindi" text articles from HTML document. The text files contains news articles which are output of the crawler code and csv files contains web pages which has crawled. Each text file contains news articles based on the classes.

Data Pre-processing

This contains scripts to generate input files required by the ECTM system and the SVM classifier. This involves separating each article into constituent named entities and regular terms (referred to as Non-Entities).

Used Polyglot to extract Named Entities from each article (https://github.com/aboSamoor/polyglot)
Generated vocab-entity, vocab-non-entity files: vocabularies for all words found in dataset
Generated term-index-entity, term-index-non-entity: replaced each article's content by solely the entity or non-entity indices found in vocabulary. These input files needed for ECTM are stored under DataPreProcessing/processedData/ectm.
The features used as input for Vanilla SVM are the counts of the words(Entities + regular words('Non-entities') under each document.The feature files are stored under DataPreProcessing/processedData/svm.

ECTM

Once we get data divided as entity-vocab, non-entity-vocab, entity-term-index and non-entity-term-index we run ECTM model on top of it. These input files are present in DataPreprocessing/processedData/ECTM. The output file from ECTM are stored under ECTM/Output.

Feature extraction

ECTM

The files ECTM_theta, ECTM_psi are used to extract entity topic and word topic related features for each document. These features are stored under FeatureExtraction.

Quantitative Evaluation

Classification

SVM

This folder has python code to learn Support Vector Machine (SVM) using ECTM features. The csv files has ECTM features. This code uses sklearn and pandas inbuilt libraries in python.

Naive Bayes

This folder has python code for our baseline classification model - Naive Bayes. The features used are the words of the documents, excluding stop words.

Qualitative Evaluation

ECTM gives us the ability to organise our news articles into a distribution over "Entity-topics", which themselves are a distributions over Word-Topics (Non-entity words). This allows us to represent a news article by the smaller defined number of entity-topic and word-topic features instead of the large number of words used to normally represent an article. These results can be found under ECTM/Output -

Supertopic-topics - distribution of Entity-topics over the Word-topics
Top-topic entities - top K entities under each Entity-topic
Top-topic words - top K words under each Word-topic

Some results which clearly show the ability of ECTM to cluster entities and entity-centered word-topics.

Some entity-topics which can be found under ECTM/Output/V2 include:

"Banks" ~ Business
बैंक एसबीआई लोन ट्रांज़ैक्शन देशभर आईसीआईसीआई-बैंक पंजाब-नेशनल-बैंक एक्सिस ईमआई रिजरव
baink esbiaaee lon traanzaikshn deshbhr aaeesiaaeesiaaee-baink pnjaab-neshnl-baink eksis eaaee rijrv
bank SBI loan transaction country-wide ICICI-bank Punjab-National-Bank AXIS EMI Reserve
"Businessmen" ~ Business
टाटा गांधी सायरस-मिस्तरी टीसीएस टाटा-संस इंफोसिस रतन-टाटा मिस्तरी टाटा- ग्रुप कालिदास
taataa gaaandhi saayrs-mistri tisies taataa-sns infosis rtn-taataa mistri taataa- garup kaalidaas
TATA Gandhi Cyrus-Mistry TCS TATA-sons Infosys Ratan-Tata Mistri Tata-Group Kaalidas
"God" ~ Spirituality
राम हनुमान महात्मा लक्ष्मण मिशेल तुलसीदास सीता विष्णु लंका अकबर
raam hnumaan mhaatmaa lksmn mishel tulsidaas sitaa visnu lnkaa akbr
Ram Hanuman Mahatma Lakshman Michelle Tulsidas Sita Vishnu TCS Akbar
"Political Leaders"
भारत मोदी प्रधानमंत्री-नरेंद्र-मोदी नरेंद्र-मोदी फोर्बेस भारत-सरकार जेट पीएम-मोदी आचार्य-बालकृष्ण ट्विटर
bhaart modi prdhaanmntri-nrendr-modi nrendr-modi forbes bhaart-srkaar jet piem-modi aachaary-baalkrisn tvitr
Bharat Modi Prime-Minister-Narendra-Modi Narendra-Modi Forbes Bharat-Sarkaar Jet PM-Modi Aacharya-Balakrishna Twitter
"Telecom Companies" ~ Business/Technology
रिलायंस एयरटेल वोडाफोन टेलीकॉम टराई बीएसएनएल भारत अनुष्का मुकेश-अंबानी
rilaayns eyrtel vodaafon telikom traaee biesenel bhaart anuskaa mukesh-anbaani
Reliance Airtel Vodafone Telecom TRAI BSNL India Anushka Mukesh-Ambani
"Social Media" ~ Technology
गूगल भारत फेसबुक याहू यूट्यूब पेपैल अमरीका माइक्रोसॉफ्ट देवदूत विधाता
gaugal bhaart fesbuk yaahu yutyub pail amrikaa maaikrosoft devdut vidhaataa
Google India Facebook Yahoo Youtube Paypal America Microsoft Angel God

We can also see a clear relation between Entity-topics and Word-topics. For example,

E[0] = "Banks"
Entities:

बैंक एसबीआई लोन ट्रांज़ैक्शन देशभर आईसीआईसीआई-बैंक पंजाब-नेशनल-बैंक एक्सिस ईमआई रिजरव
baink esbiaaee lon traanzaikshn deshbhr aaeesiaaeesiaaee-baink pnjaab-neshnl-baink eksis eaaee rijrv
bank SBI loan transaction country-wide ICICI-bank Punjab-National-Bank AXIS EMI Reserve

The Top Three word-topics for this are:

फीसदी बैंक करोड़ दर ब्यॉज रूपये कार्ड कम रुपए लाख
fisdi baink krode dr byoj rupye kaard km rupe laakh
percent bank crore rate interest rupee xard Less rupee lakh
बैंकों जारी अनुसार जवाब मामले बैंक दरों पास दिन खिलाफ
bainkon jaari anusaar jvaab maamle baink dron paas din khilaaf
banks released accordingly answer issue bank rates pass din against
सरकार साल शामिल मन्त्रालय कदम योजना क्षेत्र समस्या आम तहत
srkaar saal shaamil mntraaly kdm yojnaa ksetr smsyaa aam tht
governement years inclusion ministry steps plan area problem common under

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopicModels

DataCollection

Data Pre-processing

ECTM

Feature extraction

ECTM

Quantitative Evaluation

Classification

SVM

Naive Bayes

Qualitative Evaluation

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
CORRLDA2		CORRLDA2
DataCollection		DataCollection
DataPreprocessing		DataPreprocessing
ECTM		ECTM
Evaluation		Evaluation
FeatureExtraction		FeatureExtraction
LDA		LDA
README.md		README.md
Untitled Diagram.xml		Untitled Diagram.xml
Workflow.jpg		Workflow.jpg

singhya/TopicModels

Folders and files

Latest commit

History

Repository files navigation

TopicModels

DataCollection

Data Pre-processing

ECTM

Feature extraction

ECTM

Quantitative Evaluation

Classification

SVM

Naive Bayes

Qualitative Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages