Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orquesta workflows have incorrect status #4740

Closed
bishopbm1 opened this issue Jul 12, 2019 · 12 comments · Fixed by StackStorm/orquesta#172
Closed

Orquesta workflows have incorrect status #4740

bishopbm1 opened this issue Jul 12, 2019 · 12 comments · Fixed by StackStorm/orquesta#172

Comments

@bishopbm1
Copy link
Contributor

SUMMARY

When using orquesta workflows if a task fails but still publishes content the workflow is marked as succeeded. This happens if there is no when clause and when there is a when failed clause.

STACKSTORM VERSION

# st2 --version
st2 3.1.0, on Python 2.7.5
OS, environment, install method
# cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"

Install method is Puppet forge module

Steps to reproduce the problem

Using the examples.orquesta-basic workflow. We can need to modify the workflow to remove the when: <% succeeded() %> or modify it to be when: <% succeeded() or failed() %>

Workflow:

version: 1.0

description: A basic workflow that runs an arbitrary linux command.

input:
  - cmd
  - timeout

tasks:
  task1:
    action: core.local cmd=<% ctx(cmd) %> timeout=<% ctx(timeout) %>
    next:
      - when: <% succeeded() or failed() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>

or:

version: 1.0

description: A basic workflow that runs an arbitrary linux command.

input:
  - cmd
  - timeout

tasks:
  task1:
    action: core.local cmd=<% ctx(cmd) %> timeout=<% ctx(timeout) %>
    next:
      - publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>

Action file was unaltered:

---
name: orquesta-basic
pack: examples
description: Run a local linux command
runner_type: orquesta
entry_point: workflows/orquesta-basic.yaml
enabled: true
parameters:
  cmd:
    required: true
    type: string
  timeout:
    type: integer
    default: 60

can be forced to fail by running invalid command:

st2 run examples.orquesta-basic cmd="secho 'hi'"

Expected Results

I would expect the results of the workflow to be failed when the task fails unless manually overridden.

Actual Results

The workflow says succeeded even though the single task failed. Looking at the logs it looks like a second subtask is formed for the publish method that succeeds and send that to the workflow.

# st2 run examples.orquesta-basic cmd="secho 'hi'"
..
id: 5d289d9e49b3d2af833a29a9
action.ref: examples.orquesta-basic
parameters: 
  cmd: secho 'hi'
status: succeeded
start_timestamp: Fri, 12 Jul 2019 14:47:58 UTC
end_timestamp: Fri, 12 Jul 2019 14:48:02 UTC
result: 
  output:
    stdout: ''
+--------------------------+---------------------+-------+------------+-------------------------------+
| id                       | status              | task  | action     | start_timestamp               |
+--------------------------+---------------------+-------+------------+-------------------------------+
| 5d289da049b3d2af5f26fa4f | failed (1s elapsed) | task1 | core.local | Fri, 12 Jul 2019 14:48:00 UTC |
+--------------------------+---------------------+-------+------------+-------------------------------+

Added debugging logs to:
https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L624
and produced the following logs showing the second subtask.

2019-07-12 09:38:43,757 140001394854000 DEBUG conducting [-] Conductor Event Status: [failed]
2019-07-12 09:38:43,757 140001394854000 DEBUG conducting [-] Conductor staged task: None
2019-07-12 09:38:43,758 140001394854000 DEBUG conducting [-] Conductor task state entry: {u'status': u'running', u'route': 0, u'next': {}, u'ctxs': {u'in': [0]}, u'prev': {}, u'id': u'task1'}
2019-07-12 09:38:43,758 140001394854000 DEBUG conducting [-] Conductor task state entry: {u'status': u'running', u'route': 0, u'next': {}, u'ctxs': {u'in': [0]}, u'prev': {}, u'id': u'task1'}
2019-07-12 09:38:43,759 140001394854000 DEBUG conducting [-] Conductor task state idx: 0
2019-07-12 09:38:43,759 140001394854000 DEBUG conducting [-] Conductor task state entry: {u'status': u'running', u'route': 0, u'next': {}, u'ctxs': {u'in': [0]}, u'prev': {}, u'id': u'task1'}
2019-07-12 09:38:43,760 140001394854000 DEBUG conducting [-] Conductor task state idx: 0
2019-07-12 09:38:43,761 140001394854000 DEBUG conducting [-] Conductor staged task: None
2019-07-12 09:38:43,761 140001394854000 DEBUG conducting [-] Conductor old task status: running
2019-07-12 09:38:43,762 140001394854000 DEBUG conducting [-] Conductor new task status: failed
2019-07-12 09:38:43,770 140001394854000 DEBUG conducting [-] Conductor Event Status: [succeeded]
2019-07-12 09:38:43,771 140001394854000 DEBUG conducting [-] Conductor staged task: {'ready': True, 'route': 0, 'ctxs': {'in': [0, 1]}, 'prev': {u'task1__t0': 0}, 'id': u'noop'}
2019-07-12 09:38:43,771 140001394854000 DEBUG conducting [-] Conductor task state entry: None
2019-07-12 09:38:43,772 140001394854000 DEBUG conducting [-] Conductor task state entry: {'ctxs': {'in': [0, 1]}, 'route': 0, 'prev': {u'task1__t0': 0}, 'id': u'noop', 'next': {}}
2019-07-12 09:38:43,772 140001394854000 DEBUG conducting [-] Conductor task state idx: 1
2019-07-12 09:38:43,773 140001394854000 DEBUG conducting [-] Conductor task state entry: {'ctxs': {'in': [0, 1]}, 'route': 0, 'prev': {u'task1__t0': 0}, 'id': u'noop', 'next': {}}
2019-07-12 09:38:43,774 140001394854000 DEBUG conducting [-] Conductor task state idx: 1
2019-07-12 09:38:43,774 140001394854000 DEBUG conducting [-] Conductor staged task: {'ready': True, 'route': 0, 'ctxs': {'in': [0, 1]}, 'prev': {u'task1__t0': 0}, 'id': u'noop'}
2019-07-12 09:38:43,775 140001394854000 DEBUG conducting [-] Conductor old task status: null
2019-07-12 09:38:43,775 140001394854000 DEBUG conducting [-] Conductor new task status: succeeded
2019-07-12 09:38:43,793 140001394854000 DEBUG workflows [-] [5d288d6149b3d2af833a2991] Conductor: {'state': {'status': 'succeeded', 'tasks': {u'task1__r0': 0, u'noop__r0': 1}, 'sequence': [{u'status': 'failed', 'term': True, u'route': 0, u'next': {u'noop__t0': True}, u'ctxs': {u'in': [0], 'out': {u'noop__t0': 1}}, u'prev': {}, u'id': u'task1'}, {'status': 'succeeded', 'term': True, 'route': 0, 'next': {}, 'ctxs': {'in': [0, 1]}, 'prev': {u'task1__t0': 0}, 'id': u'noop'}], 'contexts': [{u'parent': {u'user': u'test', u'pack': u'examples'}, u'cmd': u"secho 'hi'", u'st2': {u'workflow_execution_id': u'5d288d6249b3d2ae94b89c73', u'pack': u'examples', u'action_execution_id': u'5d288d6149b3d2af833a2991', u'api_url': u'http://127.0.0.1:9101/v1', u'user': u'test'}, u'timeout': 60}, {u'stdout': {u'failed': True, u'stderr': u'bash: secho: command not found', u'return_code': 127, u'succeeded': False, u'stdout': u''}}], 'staged': [], 'routes': [[]]}, 'errors': [{'message': 'Execution failed. See result for details.', 'type': 'error', 'result': {u'succeeded': False, u'failed': True, u'return_code': 127, u'stderr': u'bash: secho: command not found', u'stdout': u''}, 'task_id': u'task1'}], 'log': [], 'context': {u'parent': {u'user': u'test', u'pack': u'examples'}, u'st2': {u'workflow_execution_id': u'5d288d6249b3d2ae94b89c73', u'user': u'test', u'action_execution_id': u'5d288d6149b3d2af833a2991', u'api_url': u'http://127.0.0.1:9101/v1', u'pack': u'examples'}}, 'input': {u'cmd': u"secho 'hi'", u'timeout': 60}, 'graph': {'directed': True, 'graph': [], 'nodes': [{'id': u'task1'}, {'id': u'noop'}], 'adjacency': [[{u'ref': 0, u'id': u'noop', u'key': 0, u'criteria': []}], []], 'multigraph': True}, 'spec': {'catalog': 'native', 'version': '1.0', 'spec': {u'input': [u'cmd', u'timeout'], u'tasks': {u'task1': {u'action': u'core.local cmd=<% ctx(cmd) %> timeout=<% ctx(timeout) %>', u'next': [{u'publish': [{u'stdout': u'<% result() %>'}]}]}}, u'description': u'A basic workflow that runs an arbitrary linux command.', u'version': 1.0, u'output': [{u'stdout': u'<% ctx(stdout) %>'}]}}, 'output': {u'stdout': {u'failed': True, u'stderr': u'bash: secho: command not found', u'return_code': 127, u'succeeded': False, u'stdout': u''}}}
2019-07-12 09:38:43,795 140001394854000 INFO workflows [-] [5d288d6149b3d2af833a2991] Updating workflow execution from status "running" to "succeeded".
2019-07-12 09:38:43,857 140001394854000 DEBUG workflows [-] [5d288d6149b3d2af833a2991] Updating workflow liveaction from status "running" to "succeeded".
@nmaludy
Copy link
Member

nmaludy commented Jul 12, 2019

@m4dcoder this is causing us issues because we're expecting the workflow to fail since the task failed.

However, when we publish a variable to return data to the calling workflow, this automatically marks the workflow as succeeding.

@namachieli
Copy link

I'm seeing similar, but opposite. I have a workflow that should Succeed, but fails when using publish: and do: noop with no when:

Tasks fail, and cause the workflow to fail, even though every task has do: noop

@m4dcoder
Copy link
Contributor

I'm seeing similar, but opposite. I have a workflow that should Succeed, but fails when using publish: and do: noop with no when:

@namachieli Please open a separate issue.

@m4dcoder
Copy link
Contributor

@nmaludy This behavior is expected. When a task failed and there is a task transition (an entry under next), then the task is considered remediated. When a task is remediated, the workflow will continue. If you want the workflow to fail, you have to explicitly tell the workflow to fail with the engine command.

@nmaludy
Copy link
Member

nmaludy commented Jul 16, 2019

@m4dcoder well that's confusing and inconsistent with the way Mistral work (that's fine).. would be nice if this was called out somewhere

It will also cause us to duplicate our next blocks, one for success, one for failure just like Mistral

@nmaludy
Copy link
Member

nmaludy commented Jul 16, 2019

I also think (if I remember correctly) this only happens when we do a publish, if there is no publish then it fails as expected

@m4dcoder
Copy link
Contributor

m4dcoder commented Jul 16, 2019

well that's confusing and inconsistent with the way Mistral work

I really don't care about how Mistral works at this point.

@m4dcoder
Copy link
Contributor

We can update our documentation so this is clearer.

@nmaludy
Copy link
Member

nmaludy commented Jul 16, 2019

so, basically in any task that contains any transition at all we also need another transition that is:

- when: failed()
   do: fail

then copy that over/over?

@m4dcoder
Copy link
Contributor

I also think (if I remember correctly) this only happens when we do a publish, if there is no publish then it fails as expected

That's true. The rule here is that if there's an entry under next with any combination of the fields, then it's considered remediated. So even if there's just a publish, it's considered remediated.

It will also cause us to duplicate our next blocks, one for success, one for failure just like Mistral

Yes, you will have to explicitly do that. We are also working on a proposal at StackStorm/orquesta#116. I believe this will help where you can publish some vars regardless of task completion status and then nest other whens to say do this and that.

then copy that over/over

You have to provide a more concrete use case. You can conceptually do this..

  next:
    - publish: blahblahblah
      do: fail

@m4dcoder
Copy link
Contributor

@nmaludy Per conversation on https://stackstorm-community.slack.com/team/U4H00PLGP, we'll have to explore this a bit more. Regarding your proposal for this use case, when you have a task that fails and no transitions are executed, that the workflow itself should fail. By no transitions are executed, 1) no task transition matched condition or 2) when condition of one or more task transitions are met but there are no tasks to execute next.

@m4dcoder
Copy link
Contributor

From initial assessment, changes need to happen in the state machine where the task is evaluated if remediation has occurred at https://github.com/StackStorm/orquesta/blob/master/orquesta/machines.py#L657. In this case, 1) if no task transition matched condition or 2) when condition of one or more task transitions are met but there are no tasks to execute next, then the task is not considered remediated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants