Postprocessing command "pd_merge"
Command uses pandas merge function of input dataframe to merge it with subsearch query
- right - subsearch, subsearch to merge
- how - keyword argument, text, not required, default value is
inner
, type of merge to be performed: left, right, inner, outer, cross - on - keyword argument, string, required, comma separated fields list
... | pd_merge [subsearch query], how='inner', on='field1,field2,field3'
query: readFile a.csv
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
query: readFile b.csv
a c
0 1 100
1 2 200
2 3 300
3 6 600
query: readFile a.csv | pd_merge [readFile b.csv] on="a", how=inner
1 out of 2 command. Command pd_merge. Start pd_merge command
a b c
0 1 10 100
1 2 20 200
2 3 30 300
query: readFile a.csv | pd_merge [readFile b.csv] on="a", how=right
a b c
0 1 10.0 100
1 2 20.0 200
2 3 30.0 300
3 6 NaN 600
query: readFile a.csv | pd_merge [readFile b.csv] on="a", how=left
1 out of 2 command. Command pd_merge. Start pd_merge command
a b c
0 1 10 100.0
1 2 20 200.0
2 3 30 300.0
3 4 40 NaN
4 5 50 NaN
- Create virtual environment with post-processing sdk
make dev
That command
- downloads Miniconda
- creates python virtual environment with postprocessing_sdk
- creates link to current command in postprocessing
pp_cmd
directory
- Configure
otl_v1
command. Example:
vi ./venv/lib/python3.9/site-packages/postprocessing_sdk/pp_cmd/otl_v1/config.ini
Config example:
[spark]
base_address = http://localhost
username = admin
password = 12345678
[caching]
# 24 hours in seconds
login_cache_ttl = 86400
# Command syntax defaults
default_request_cache_ttl = 100
default_job_timeout = 100
- Configure storages for
readFile
andwriteFile
commands:
vi ./venv/lib/python3.9/site-packages/postprocessing_sdk/pp_cmd/readFile/config.ini
Config example:
[storages]
lookups = /opt/otp/lookups
pp_shared = /opt/otp/shared_storage/persistent
Use pp
to run pd_merge command:
pp
Storage directory is /tmp/pp_cmd_test/storage
Commmands directory is /tmp/pp_cmd_test/pp_cmd
query: | otl_v1 <# makeresults count=100 #> | pd_merge
Unpack archive pp_cmd_pd_merge
to postprocessing commands directory
Use make test
and all test will run in Docker container. Please turn the vpn on so all the OTL dependencies would download.