libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.
The library contains the following packages:
fnl.nlp
- tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;
fnl.stat
- a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn
fnl.text
- wrappers to work with text data (strings, tokens, segments, annotations, etc.)
fnl.utils
- additional utilities and tools (currently, just for handling JSON)
scripts
- the CLI scripts to manage data/text, representing the main value provided by this collection
The script directory provides the following command-line interfaces:
fnlclassi
generate a classifier for [NER-tagged] text using Scikit-Learn.fnlcorpus
store corpora in JSON format in a CouchDB.fnldgrep
"grep" for tokens using a dictionary.fnldictag
tag semantic tokens from a dictionary in linguistically annotated text.fnlgpcounter
count gene/protein symbols in MEDLINE.fnlkappa
calculate inter-rater agreement scores.fnlsegment
segment text into sentences using NLTK (PunktSentenceTokenizer).fnlsegtrain
train a nltk.punkt.PunktSentenceTokenizer.fnltok
a fast, pure-Python, Unicode-aware string tokenizer.
Warning
This project is under "continuous development", better take your own snapshot.
- Python 3.2+
- Numpy, SciPy, and Scikit-Learn 0.14+ (for
fnlclassi
) - NLTK 3.0+ (for the sentence segmenting tools
fnlseg*
) - DAWG (for
fnlgpcounter
; see Installation below)
Optional projects that work together with this project:
- GENIA Tagger (optional, latest version)
- NER Suite (optional, latest version, in turn requires CRF Suite)
- MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
- gnamed for creating gene/protein name repositories
- medic for mirroring and handling PubMed citations
- txtfnnl natural language processing tools based on Apache OpenNLP and UIMA
Into a Python 3 virtual environment:
pip install virtualenv # if virtualenv is not yet installed git clone git://github.com/fnl/libfnl.git libfnl virtualenv libfnl cd libfnl . bin/activate pip install argparse # for python3 < 3.2 pip install numpy # because installing scipy fails if numpy isn't installed already pip install -e . # installs all other dependencies # if you prefer to install all other dependencies manually # and/or prefer to use setup.py instead of pip: # python setup.py install pip install sqlalchemy pip install sklearn pip install matplotlib pip install nltk --pre # to get 3.0 # if you want to install the test environment: pip install pytest # special steps to install DAWG git clone [email protected]:fnl/DAWG.git cd DAWG python setup.py install cd ..
All parts of this library are licensed under the GNU Affero GPL v3
See the attached LICENSE.txt file.
© 2006-2014 Florian Leitner. All rights reserved.