Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Security privilege evaluation for wildcard index pattern can stall network threads(HTTP/Transport) #5022

Open
Bukhtawar opened this issue Jan 13, 2025 · 7 comments
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@Bukhtawar
Copy link

Bukhtawar commented Jan 13, 2025

What is the bug?
A clear and concise description of the bug.

Observed during a OpenSearch dashboard call from discover page trying to list * index pattern, triggers a */_field_caps API call which goes ahead and performs the fine grained privilege evaluation. In case when the indices count are large in number and/or have multiple roles configured, this can cause privilege evaluation to slow down.

Now if the HTTP network threads aka event loop threads(meant for async IO i.e. read/write from socket channel) perform CPU or IO intensive work, it might cause the other requests bound to the same socket to get stalled(since wildcard privilege evaluation is a function of the number of indices and the roles). This might manifest as request timeouts, delays and worse case, external health checks to fail.

100.1% (5s out of 5s) cpu usage by thread 'opensearch[02df94be6e6ebfc89cd831c9875c570e][http_server_worker][T#4]'
     8/10 snapshots sharing following 71 elements
       org.opensearch.security.securityconf.ConfigModelV7$IndexPattern.attemptResolveIndexNames(ConfigModelV7.java:765)
       org.opensearch.security.securityconf.ConfigModelV7.lambda$impliesTypePerm$4(ConfigModelV7.java:1016)
       org.opensearch.security.securityconf.ConfigModelV7$$Lambda$6496/0x000000c801b02228.apply(Unknown Source)
       [email protected]/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
       [email protected]/java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1707)
       [email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
       [email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
       [email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575)
       [email protected]/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
       [email protected]/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616)
       org.opensearch.security.securityconf.ConfigModelV7.impliesTypePerm(ConfigModelV7.java:1017)
       org.opensearch.security.securityconf.ConfigModelV7$SecurityRoles.impliesTypePermGlobal(ConfigModelV7.java:491)
       org.opensearch.security.privileges.PrivilegesEvaluator.evaluate(PrivilegesEvaluator.java:489)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:303)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:149)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:217)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:189)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:108)
       app//org.opensearch.action.fieldcaps.TransportFieldCapabilitiesAction.doExecute(TransportFieldCapabilitiesAction.java:139)
       app//org.opensearch.action.fieldcaps.TransportFieldCapabilitiesAction.doExecute(TransportFieldCapabilitiesAction.java:66)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:219)
       org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:87)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:217)
       org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:217)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:317)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:149)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:217)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:189)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:108)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:472)
       app//org.opensearch.client.support.AbstractClient.fieldCaps(AbstractClient.java:728)
       app//org.opensearch.rest.action.RestFieldCapabilitiesAction.lambda$prepareRequest$1(RestFieldCapabilitiesAction.java:88)
       app//org.opensearch.rest.action.RestFieldCapabilitiesAction$$Lambda$7735/0x000000c801eebb90.accept(Unknown Source)

Even a _bulk request call takes roughly between few 100s of ms to few seconds which results in elevated latencies. Similarly if this evaluation happens on transport threads, they can also get stalled.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create a cluster with 5000 indices and multiple backend roles
  2. Perform a */_field_caps API calls using one of those user roles
  3. Capture hot_threads or thread dumps
  4. Notice privilege evaluation taking more than 60s

What is the expected behavior?

  1. Privilege evaluation should not block network threads
  2. Privilege evaluation can be optimised or parallelised to speed up requests

Proposal

  1. Fork privilege evaluation to a separate thread pool and call back once complete
  2. Optimise/parallelise privilege evaluation

What is your host/environment?

  • OS: [e.g. iOS] :
  • Version [e.g. 22] OS 2.7 or above
  • Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

@Bukhtawar Bukhtawar added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jan 13, 2025
@kumargu
Copy link

kumargu commented Jan 13, 2025

cc @nibix (for visibility)

@nibix
Copy link
Collaborator

nibix commented Jan 13, 2025

cc @nibix (for visibility)

It's likely that this is fixed by #4380.

@Bukhtawar how many indices and roles do you have on the affected cluster?

@Bukhtawar
Copy link
Author

Thanks @nibix

I see two problems

  1. Guard rails to protect network threads from getting stalled due to CPU intensive privilege evaluation
  2. Performance optimisations for privilege evaluation

For 1. I feel that for the '*' based wildcard pattern we could have users with 10k indices and 100s of roles, currently this takes more than 300s without the optimisation. If with the optimisation this time reduces to less than a couple of seconds, I am good. However if the time is proportional to the count of indices which in future could grow to 10s of thousands of indices and the time to evaluate the privilege goes into 10s of seconds. In that case I strongly feel this task needs to be offloaded off of network threads.

It would be good to see some benchmarks on the same

@nibix
Copy link
Collaborator

nibix commented Jan 13, 2025

@Bukhtawar

If with the optimisation this time reduces to less than a couple of seconds, I am good.

The optimization yields constant time for all cases except index expressions with patterns other than the full wildcard *.

See this for benchmarks and before-after comparisons: https://eliatra.com/blog/performance-improvements-for-the-access-control-layer-of-opensearch/

Do you mean with "network threads" the threads processing the transport requests? I do not think that offloading privilege evaluation off the transport threads will be possible without major conceptual changes. At the moment, the access control concept is tightly coupled to the execution of transport requests.

@Bukhtawar
Copy link
Author

I am yet to look at the benchmarks however if the worst case privilege evaluation runs over 5-10s of second we don't have a choice but to offload else all critical requests would either slow down or stall.

Do you mean with "network threads" the threads processing the transport requests? I do not think that offloading privilege evaluation off the transport threads will be possible without major conceptual changes. At the moment, the access control concept is tightly coupled to the execution of transport requests.

In this context I am referring to the http worker threads that are handling the REST requests. I do feel that forking in the transport action class in this case the TransportFieldCapabilitiesAction#doExecute class is something that can be done before the loop, something like below if we are looking to tackle this per API level

 threadPool.executor(executorName).execute(new ActionRunnable<FieldCapabilitiesResponse>(listener) {
            @Override
                     protected void doRun() {
                     doInternalExecute(task, request, executorName, actionListener);
         }

But more generically this needs to sit at TransportActions around these lines. Alternatively we can introduce a transport filter to apply this to selective actions.

@nibix
Copy link
Collaborator

nibix commented Jan 13, 2025

I am yet to look at the benchmarks however if the worst case privilege evaluation runs over 5-10s of second we don't have a choice but to offload else all critical requests would either slow down or stall.

I'd expect the optimized privilege evaluation code to run within milliseconds for all relevant cases.

@cwperks cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jan 13, 2025
@cwperks
Copy link
Member

cwperks commented Jan 13, 2025

[Triage] I noticed that ISM is also apply an actionFilter

org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:87)

When is this filter from ISM applied exactly?

Edit: This filter is not adding considerable time to the request. Based on the line in ISM it is exiting early and continuing with the rest of the chain of action filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

4 participants