Skip to content

Release 0.3.0: Lineage Performance and Stable API

Compare
Choose a tag to compare
@EntilZha EntilZha released this 09 Jun 22:28
· 365 commits to master since this release

The primary goal of this release was to improve performance of longer data pipelines. Additionally, there were additional API additions and several minor breaking changes.

Performance Improvements

The largest under the hood change is changing all operations to be lazy by default. 0.2.0 calculates a new list at every transformation. This was initially implemented using generators, but this could lead to unexpected behavior. The problem with this approach is highlighted in #20. Code sample below:

from functional import seq
def gen():
    for e in range(5):
    yield e

nums = gen()
s = seq(nums)
s.map(lambda x: x * 2).sum()
# prints 20
s.map(lambda x: x * 2).sum()
# prints 0
s = seq([1, 2, 3, 4])
a = s.map(lambda x: x * 2)
a.sum()
# prints 20
a.sum()
# prints 0

Either, ScalaFunctional would need to aggressively cache results or a new approach was needed. That approach is called lineage. The basic concept is that ScalaFunctional:

  1. Tracks the most recent concrete data (eg list of objects)
  2. Tracks the list of transformations that need to be applied to the list to find the answer
  3. Whenever an expression is evaluated, the result is cached for (1) and returned

The result is the problems above are fixed, below is an example showing how the backend calculates results:

from functional import seq

In [8]: s = seq(1, 2, 3, 4)

In [9]: s._lineage
Out[9]: Lineage: sequence

In [10]: s0 = s.map(lambda x: x * 2)

In [11]: s0._lineage
Out[11]: Lineage: sequence -> map(<lambda>)

In [12]: s0
Out[12]: [2, 4, 6, 8]

In [13]: s0._lineage
Out[13]: Lineage: sequence -> map(<lambda>) -> cache

Note how initially, since the expression is not evaluated, it is not cached. Since printing s0 in the repl calls __repr__, it is evaluated and cached so it is not recomputed if s0 is used again. You can also call cache() directly if desired. You may also notice that seq can now take a list of arguments like list (added in #27).

Next up

Improvements in documentation and redo of README.md. Next release will be focused on extending ScalaFunctional further to work with other data input/output and more usability improvements. This release also marks relative stability in the collections API. Everything that seemed worth porting from Scala/Spark has been completed with a few additions (predominantly left, right, inner, and outer joins). There aren't currently any foreseeable breaking changes.