Release 0.3.0: Lineage Performance and Stable API
The primary goal of this release was to improve performance of longer data pipelines. Additionally, there were additional API additions and several minor breaking changes.
Performance Improvements
The largest under the hood change is changing all operations to be lazy by default. 0.2.0
calculates a new list at every transformation. This was initially implemented using generators, but this could lead to unexpected behavior. The problem with this approach is highlighted in #20. Code sample below:
from functional import seq
def gen():
for e in range(5):
yield e
nums = gen()
s = seq(nums)
s.map(lambda x: x * 2).sum()
# prints 20
s.map(lambda x: x * 2).sum()
# prints 0
s = seq([1, 2, 3, 4])
a = s.map(lambda x: x * 2)
a.sum()
# prints 20
a.sum()
# prints 0
Either, ScalaFunctional
would need to aggressively cache results or a new approach was needed. That approach is called lineage. The basic concept is that ScalaFunctional
:
- Tracks the most recent concrete data (eg list of objects)
- Tracks the list of transformations that need to be applied to the list to find the answer
- Whenever an expression is evaluated, the result is cached for (1) and returned
The result is the problems above are fixed, below is an example showing how the backend calculates results:
from functional import seq
In [8]: s = seq(1, 2, 3, 4)
In [9]: s._lineage
Out[9]: Lineage: sequence
In [10]: s0 = s.map(lambda x: x * 2)
In [11]: s0._lineage
Out[11]: Lineage: sequence -> map(<lambda>)
In [12]: s0
Out[12]: [2, 4, 6, 8]
In [13]: s0._lineage
Out[13]: Lineage: sequence -> map(<lambda>) -> cache
Note how initially, since the expression is not evaluated, it is not cached. Since printing s0
in the repl calls __repr__
, it is evaluated and cached so it is not recomputed if s0
is used again. You can also call cache()
directly if desired. You may also notice that seq
can now take a list of arguments like list
(added in #27).
Next up
Improvements in documentation and redo of README.md
. Next release will be focused on extending ScalaFunctional
further to work with other data input/output and more usability improvements. This release also marks relative stability in the collections API. Everything that seemed worth porting from Scala/Spark has been completed with a few additions (predominantly left, right, inner, and outer joins). There aren't currently any foreseeable breaking changes.