Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected poor performance handling CDATA section #40

Open
tsouza opened this issue Oct 31, 2016 · 1 comment
Open

Unexpected poor performance handling CDATA section #40

tsouza opened this issue Oct 31, 2016 · 1 comment

Comments

@tsouza
Copy link

tsouza commented Oct 31, 2016

I've ran a xml filter benchmark against 2 types of xml log entries:

  1. One that wraps a nested xml text inside a CDATA section.
  2. One that escapes a nested xml text.

In both cases, this nested xml text is ~18k chars.

I've ran the benchmark in a dedicated Core i5 (4x cores) with 16GB RAM.

I've used Logstash v2.4.0.

JVM is:

openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-2-b14)
OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)

Linux:

Linux homeserver 4.8.0-26-generic #28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

The logstash configuration is as following:
(from https://www.elastic.co/guide/en/logstash/2.4/plugins-filters-metrics.html)

input {
  generator {
    lines => ['<sample_xml_log>']
  }
}

filter {
  xml {
    source => message
    target => entry
  }
  metrics {
    meter => "events"
    add_tag => "metric"
  }

}

output {
  # only emit events with the 'metric' tag
  if "metric" in [tags] {
    stdout {
      codec => line {
        format => "rate: %{[events][rate_1m]}"
      }
    }
  }
}

The output for the log entry that uses CDATA section is:

rate: 6.032965896569965
rate: 6.232386633063851
rate: 6.245788257933181
rate: 6.276051178133002
rate: 6.295079587467865
rate: 6.101150213348986
rate: 6.070909269120961
rate: 5.995443066741132
rate: 5.983002341013909
rate: 6.01280267908362
rate: 6.02451769874094
rate: 6.011403109806091
rate: 6.006735744554418
rate: 6.028854466363812
rate: 6.044576382685371
rate: 6.034575149976681
rate: 6.01545334110413
rate: 6.018715406521946
rate: 6.001227888163898
rate: 6.033517878763309
rate: 6.01778815696872
rate: 6.016365894465621
rate: 5.774201799793936
rate: 5.851427472899567
rate: 5.863306676273883
rate: 5.960459293207369
rate: 5.969307896129195
rate: 5.97401969497492
rate: 6.007417476593245
rate: 6.037758970727505
rate: 6.018748813045858
rate: 6.062558991100919
rate: 6.073548167421381
rate: 6.051676463568129
rate: 6.039506676632154
rate: 6.033441679780281

The output for the log entry that does NOT uses CDATA section is:

rate: 47.8
rate: 50.27862314649097
rate: 58.27311268621327
rate: 69.11691806009138
rate: 74.19274503773141
rate: 79.10259805424599
rate: 87.34742691392835
rate: 89.3185378259186
rate: 92.84309693804262
rate: 94.80655849725238
rate: 100.21004309947467
rate: 100.21004309947467
rate: 103.13561452214401
rate: 109.14742413686035
rate: 111.83813553922879
rate: 114.31370953635731
rate: 116.92716102397382
rate: 118.96385677535498
rate: 120.90167179400903
rate: 122.92441443461854
rate: 124.48159627893531
rate: 125.96224608858356
rate: 127.27653632474998
rate: 129.9340955963991
rate: 129.9340955963991
rate: 130.78689422649668
rate: 132.60297900678245
rate: 133.78608348203736
rate: 134.41084975126867
rate: 135.11359140431682
rate: 135.9680294590949
rate: 136.37036360926098
rate: 137.03312405112845
rate: 137.75007586656673
rate: 137.9779432189174
rate: 138.25155577202057
rate: 138.8710871659807
rate: 139.07328787197613
rate: 139.3223786224067

I believe it's rather odd that the use of CDATA actually makes things worse. In theory the parser should just ignore and fast-forward until it finds a closing ]]>

@jordansissel
Copy link
Contributor

The performance difference there looks very stark (6 events per second vs 140 events per second). I agree this is very strange!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants