Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficient cheerio.load #1343

Open
JonathanMontane opened this issue Aug 22, 2019 · 2 comments
Open

Memory efficient cheerio.load #1343

JonathanMontane opened this issue Aug 22, 2019 · 2 comments

Comments

@JonathanMontane
Copy link

Hi,

I am software engineer at Algolia and we love your library. However, we've encountered a pickle with certain documents we are trying to process, which just fail to be loaded because of the memory consumption of cheerio.load.

After doing some analysis, it seems that the cheerio.load will transform a 5MB file into a 150MB - 500MB memory representation. That's a x30 to x100 increase in size.

It would be awesome for us to have a more memory-efficient parser. I have looked into how the htmlparser2 library is used and it seems to me that it could be possible to have a more efficient representation of the elements, but I am not 100% sure how.

Could this type of constraint be something you consider for a future release?
Thank you!

Code snippet used for measurements:

const cheerio = require('cheerio');

const generateWideFile = (siblings) => {
  const elts = `<div>Some Element</div>`.repeat(siblings);
  return `<html><body>${elts}</body></html>`;
}

const testMemoryPressure = () => {
  // const sizes = [100, 200, 300, 400, 500];
  // const sizes = [1000, 2000, 3000, 4000, 5000];
  // const sizes = [10000, 20000, 30000, 40000, 50000];
  const sizes = [100000, 200000, 300000, 400000, 500000];
  global.gc();
  const base = process.memoryUsage().heapUsed;

  const memory = sizes.map((size) => {
    global.gc();
    const html = generateWideFile(size);
    const $ = cheerio.load(html, {
      //_useHtmlParser2: true,
      decodeEntities: true,
      normalizeWhitespace: false,
      xmlMode: false,
    });
    global.gc();
    const usedSize = process.memoryUsage().heapUsed;
    $.html();
    const memory = usedSize - base;
    return { memory, size: html.length, ratio: Math.round(memory / html.length) };
  });

  console.log(memory);
}

testMemoryPressure();

cheerio version: 1.0.0-rc.3

@5saviahv
Copy link
Contributor

This issue is untouched so long ... maybe it is related with #263 and that V8 bug in general?

@myfreeer
Copy link

Maybe related to #1960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants