FELTS

FELTS is a Fast Extractor for Large Term Sets. It was successfully tested with over 9.5 millions of distinct multiword terms composed of over 4.5 million distinct words (Wikipedia article titles for french + english + spanish). In this particular task :

it allows extracting from any text, all occurences of wikipedia french, english or spanish entries
it only requires 500 Mb of RAM
it can process ten million of words less than an hour

USE :

create a dictionnary file with a sorted list of multiword terms (one term per line, one space between words).
set the DICT variable in makefile to your dictionnary file
make the hash function :

make mph

start a server, e.g :

bin/felts_server -p 11111 -d sample.dic -f sample.mph

extract terms, e.g. :

cat text_in.txt | sed 's/[[:space:]][[:space:]]*/ /g' | sed 's/^[[:space:]]//' | bin/felts_client localhost 11111 | sed '/^$/d' > terms_out.txt

WARNING : input text should be utf-8, lower case, without punctuation and words must be separated by a single space. (that justifies the sequence of filters used before sending the text to felts_client)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
cmph-2.0		cmph-2.0
demo		demo
dic		dic
socio		socio
src		src
README.md		README.md
TODO		TODO
license.en.txt		license.en.txt
license.fr.txt		license.fr.txt
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FELTS

About

Releases

Packages

alemol/FELTS

Folders and files

Latest commit

History

Repository files navigation

FELTS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages