Skip to content
/ FELTS Public
forked from jourlin/FELTS

FELTS is a Fast Extractor for Large Term Sets - successfully tested with over 5 millions distinct multiword terms composed of over 2 million distinct words

Notifications You must be signed in to change notification settings

alemol/FELTS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FELTS

FELTS is a Fast Extractor for Large Term Sets. It was successfully tested with over 9.5 millions of distinct multiword terms composed of over 4.5 million distinct words (Wikipedia article titles for french + english + spanish). In this particular task :

  • it allows extracting from any text, all occurences of wikipedia french, english or spanish entries
  • it only requires 500 Mb of RAM
  • it can process ten million of words less than an hour

USE :

  • create a dictionnary file with a sorted list of multiword terms (one term per line, one space between words).
  • set the DICT variable in makefile to your dictionnary file
  • make the hash function :

make mph

  • start a server, e.g :

bin/felts_server -p 11111 -d sample.dic -f sample.mph

  • extract terms, e.g. :

cat text_in.txt | sed 's/[[:space:]][[:space:]]*/ /g' | sed 's/^[[:space:]]//' | bin/felts_client localhost 11111 | sed '/^$/d' > terms_out.txt

WARNING : input text should be utf-8, lower case, without punctuation and words must be separated by a single space. (that justifies the sequence of filters used before sending the text to felts_client)

About

FELTS is a Fast Extractor for Large Term Sets - successfully tested with over 5 millions distinct multiword terms composed of over 2 million distinct words

Resources

Stars

Watchers

Forks

Packages

No packages published