- @senwu: fix non-deterministic issue from get_candidates and get_mentions by parallel candidate/mention generation.
- @lukehsiao: Add soft version pinning to avoid failures due to dependency API changes.
- @senwu: Rename
span
attribute tocontext
in mention_subclass to better support mulitmodal mentions. (#184)
Note
The way to retrieve corresponding data model object from mention changed.
In Fonduer v0.3.6, we use .span
:
# sent_mention is a SentenceMention
sentence = sent_mention.span.sentence
With this release, we use .context
:
# sent_mention is a SentenceMention
sentence = sent_mention.context.sentence
Note
The Mention extraction support all data types in data model. In Fonduer
v0.3.6, Mention extraction only supports MentionNgrams
and
MentionFigures
:
from fonduer.candidates import (
MentionFigures,
MentionNgrams,
)
With this release, it supports all data types:
from fonduer.candidates import (
MentionCaptions,
MentionCells,
MentionDocuments,
MentionFigures,
MentionNgrams,
MentionParagraphs,
MentionSections,
MentionSentences,
MentionTables,
)
- @senwu: Add support to parse multiple sections in parser, fix webpage context, and add name column for each context in data model. (#182)
- @senwu: Remove unnecessary backref in mention generation.
- @j-rausch: Improve error handling for invalid row spans. (#183)
- @lukehsiao: Updated snorkel-metal version requirement to ensure new syntax works when a user upgrades Fonduer.
- @lukehsiao: Improve error messages on PostgreSQL connection and update FAQ.
Note
With the SparseLSTM discriminative model, we save memory for the origin LSTM model while sacrificing runtime. In Fonduer v0.3.5, SparseLSTM is as follows:
from fonduer.learning import SparseLSTM
disc_model = SparseLSTM()
disc_model.train(
(train_cands, train_feature), train_marginals, n_epochs=5, lr=0.001
)
- @senwu: Fix issue with
get_last_documents
returning the incorrect number of docs and update the tests. (#176) - @senwu: Use the latest MeTaL syntax and fix flake8 issues. (#173)
- @senwu: Use
sqlalchemy
to check connection string. Usepostgresql
instead ofpostgres
in connection string.
- @lukehsiao: The features/labels/gold_label key tables were not properly designed for multiple relations in that they indistinguishably shared the global index of keys. This fixes this issue by including the names of the relations associated with each key. In addition, this ensures that clearing a single relation, or relabeling a single training relation does not inadvertently corrupt the global index of keys. (#167)
- @lukehsiao: Added
longest_match_only
parameter to :class:`LambdaFunctionMatcher`, which defaults to False, rather than True. (#165)
- @lukehsiao: Fixes the behavior of the
get_between_ngrams
data model util. (#164) - @lukehsiao: Batch queries so that PostgreSQL buffers aren't exceeded. (#162)
- @lukehsiao: Fix attribute error when using MentionFigures.
- @lukehsiao: :class:`MentionNgrams`
split_tokens
now defaults to an empty list and splits on all occurrences, rather than just the first occurrence. - @j-rausch: Parser will now skip documents with parsing errors rather than crashing.
- @lukehsiao: Fix the layers module in fonduer.learning.disc_models.layers.
- @lukehsiao: Add supporting functions for incremental knowledge base construction. (#154)
- @j-rausch: Added alpha spacy support for Japanese tokenizer.
- @senwu: Add sparse logistic regression support.
- @senwu: Support Python 3.7.
- @lukehsiao: Allow user to change featurization settings by providing
.fonduer-config.yaml
in their project. - @lukehsiao: Add a new Mention object, and have Candidate objects be composed of Mention objects, rather than directly of Spans. This allows a single Mention to be reused in multiple relations.
- @lukehsiao: Improved connection-string validation for the Meta class.
- @j-rausch:
Document.text
now returns the modified document text, based on the user-defined html-tag stripping in the parsing stage. - @j-rausch:
Ngrams
now has an_min
argument to specify a minimum number of tokens per extracted n-gram. - @lukehsiao: Rename
BatchLabelAnnotator
toLabeler
andBatchFeatureAnnotator
toFeaturizer
. The classes now support multiple relations. - @j-rausch: Made spacy tokenizer to default tokenizer, as long as there
is (alpha) support for the chosen language.
`lingual`
argument now specifies whether additional spacy NLP processing shall be performed. - @senwu: Reorganize the disc model structure. (#126)
- @lukehsiao: Add
session
andparallelism
as a parameter to all UDF classes. - @j-rausch: Sentence splitting in lingual mode is now performed by spacy's sentencizer instead of the dependency parser. This can lead to variations in sentence segmentation and tokenization.
- @j-rausch: Added
language
argument toParser
for specification of language used byspacy_parser
. E.g.language='en'`
. - @senwu: Change weak supervision learning framework from numbskull to MeTaL <https://github.com/HazyResearch/metal>_. (#119)
- @senwu: Change learning framework from Tensorflow to PyTorch. (#115)
- @lukehsiao: Blacklist <script> nodes by default when parsing HTML docs.
- @lukehsiao: Reorganize ReadTheDocs structure to mirror the repository structure. Now, each pipeline phase's user-facing API is clearly shown.
- @lukehsiao: Rather than importing ambiguously from
fonduer
directly, disperse imports into their respective pipeline phases. This eliminates circular dependencies, and makes imports more explicit and clearer to the user where each import is originating from. - @lukehsiao: Provide debug logging of external subprocess calls.
- @lukehsiao: Use
tdqm
for progress bar (including multiprocessing). - @lukehsiao: Set the default PostgreSQL client encoding to "utf8".
- @lukehsiao: Organize documentation for
data_model_utils
by modality. (#85) - @lukehsiao: Rename
lf_helpers
todata_model_utils
, since they can be applied more generally to throttlers or used for error analysis, and are not limited to just being used in labeling functions. - @lukehsiao: Update the CHANGELOG to start following KeepAChangelog conventions.
- @lukehsiao: Remove the XMLMultiDocPreprocessor.
- @lukehsiao: Remove the
reduce
option for UDFs, which were unused. - @lukehsiao: Remove get parent/children/sentence generator from Context. (#87)
- @lukehsiao: Remove dependency on
pdftotree
, which is currently unused.
- @j-rausch: Improve
spacy_parser
performance. We split the lingual parsing pipeline into two stages. First, we parse structure and gather all sentences for a document. Then, we merge and feed all sentences per document into the spacy NLP pipeline for more efficient processing. - @senwu: Speed-up of
_get_node
using caching. - @HiromuHota: Fixed bug with Ngram splitting and empty TemporarySpans. (#108, #112)
- @lukehsiao: Fixed PDF path validation when using
visual=True
during parsing. - @lukehsiao: Fix Meta bug which would not switch databases when init() was called with a new connection string.
Note
With the addition of Mentions, the process of Candidate extraction has changed. In Fonduer v0.2.3, Candidate extraction was as follows:
candidate_extractor = CandidateExtractor(PartAttr,
[part_ngrams, attr_ngrams],
[part_matcher, attr_matcher],
candidate_filter=candidate_filter)
candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)
With this release, you will now first extract Mentions and then extract Candidates based on those Mentions:
# Mention Extraction
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
temp_ngrams = MentionNgramsTemp(n_max=2)
volt_ngrams = MentionNgramsVolt(n_max=1)
Part = mention_subclass("Part")
Temp = mention_subclass("Temp")
Volt = mention_subclass("Volt")
mention_extractor = MentionExtractor(
session,
[Part, Temp, Volt],
[part_ngrams, temp_ngrams, volt_ngrams],
[part_matcher, temp_matcher, volt_matcher],
)
mention_extractor.apply(docs, split=0, parallelism=PARALLEL)
# Candidate Extraction
PartTemp = candidate_subclass("PartTemp", [Part, Temp])
PartVolt = candidate_subclass("PartVolt", [Part, Volt])
candidate_extractor = CandidateExtractor(
session,
[PartTemp, PartVolt],
throttlers=[temp_throttler, volt_throttler]
)
candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)
Furthermore, because Candidates are now composed of Mentions rather than
directly of Spans, to get the Span object from a mention, use the .span
attribute of a Mention.
Note
Fonduer has been reorganized to require more explicit import syntax. In Fonduer v0.2.3, nearly everything was imported directly from fonduer:
from fonduer import (
CandidateExtractor,
DictionaryMatch,
Document,
FeatureAnnotator,
GenerativeModel,
HTMLDocPreprocessor,
Intersect,
LabelAnnotator,
LambdaFunctionMatcher,
MentionExtractor,
Meta,
Parser,
RegexMatchSpan,
Sentence,
SparseLogisticRegression,
Union,
candidate_subclass,
load_gold_labels,
mention_subclass,
)
With this release, you will now import from each pipeline phase. This makes imports more explicit and allows you to more clearly see which pipeline phase each import is associated with:
from fonduer import Meta
from fonduer.candidates import CandidateExtractor, MentionExtractor
from fonduer.candidates.matchers import (
DictionaryMatch,
Intersect,
LambdaFunctionMatcher,
RegexMatchSpan,
Union,
)
from fonduer.candidates.models import candidate_subclass, mention_subclass
from fonduer.features import Featurizer
from metal.label_model import LabelModel # GenerativeModel in v0.2.3
from fonduer.learning import SparseLogisticRegression
from fonduer.parser import Parser
from fonduer.parser.models import Document, Sentence
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.supervision import Labeler, get_gold_labels
- @lukehsiao: Support Figures nested in Cell contexts and Paragraphs in Figure contexts. (#84)
Note
Version 0.2.0 and 0.2.1 had to be skipped due to errors in uploading those versions to PyPi. Consequently, v0.2.2 is the version directly after v0.1.8.
Warning
This release is NOT backwards compatable with v0.1.8. The code has now been refactored into submodules, where each submodule corresponds with a phase of the Fonduer pipeline. Consequently, you may need to adjust the paths of your imports from Fonduer.
- @lukehsiao: Remove the futures imports, truly making Fonduer Python 3 only. Also reorganize the codebase into submodules for each pipeline phase. (#59)
- @lukehsiao: Split models and preprocessors into individual files. (#60, #64)
- @senwu: Add branding, OSX tests. (#61, #62)
- @lukehsiao: Rename to Phrase to Sentence. (#72)
- @lukehsiao: Update the Data Model to include Caption, Section, Paragraph. (#76, #77, #78)
- @senwu: Split up lf_helpers into separate files for each modality. (#81)
- A variety of small bugfixes and code cleanup. (view milestone)
- @senwu: Remove the Viewer, which is unused in Fonduer (#55)
- @senwu: Fix SimpleTokenizer for lingual features are disabled (#53)
- @prabh06: Extend styles parsing and add regex search (#52)
- @lukehsiao: Remove unnecessary encoding in __repr__ (#50)
- @lukehsiao: Fix LocationMatch NER tags for spaCy (#50)
Warning
This release is NOT backwards compatable with v0.1.6. Specifically, the
snorkel
submodule in fonduer has been removed. Any previous imports of
the form:
from fonduer.snorkel._ import _
Should drop the snorkel
submodule:
from fonduer._ import _
Tip
To leverage the logging output of Fonduer, such as in a Jupyter Notebook, you can configure a logger in your application:
import logging
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)
- @lukehsiao: Remove SQLite code, switch to logging, and absorb snorkel codebase directly into the fonduer package for simplicity (#44)
- @lukehsiao: Add lf_helpers to ReadTheDocs (#42)
- @lukehsiao: Remove unused package dependencies (#41)
- @senwu: Fix support for providing a PostgreSQL username and password as part of the connection string provided to Meta.init() (#40)
- @lukehsiao: Switch README from Markdown to reStructuredText
Warning
This release is NOT backwards compatable with v0.1.4. Specifically, in order to initialize a session with postgresql, you no longer do
os.environ['SNORKELDB'] = 'postgres://localhost:5432/' + DBNAME
from fonduer import SnorkelSession
session = SnorkelSession()
which had the side-effects of manipulating your database tables on import
(or creating a snorkel.db
file if you forgot to set the environment
variable). Now, you use the Meta class to initialize your session:
from fonduer import Meta
session = Meta.init("postgres://localhost:5432/" + DBNAME).Session()
No side-effects occur until Meta
is initialized.
- @lukehsiao: Remove reliance on environment vars and remove side-effects of importing fonduer (#36)
- @lukehsiao: Bring codebase in PEP8 compliance and add automatic code-style checks (#37)
- @lukehsiao: Separate tutorials into their own repo (#31)
Minor hotfix to the README formatting for PyPi.
- @lukehsiao: Deploy Fonduer to PyPi using Travis-CI