Memory efficient cheerio.load #1343

JonathanMontane · 2019-08-22T14:42:09Z

Hi,

I am software engineer at Algolia and we love your library. However, we've encountered a pickle with certain documents we are trying to process, which just fail to be loaded because of the memory consumption of cheerio.load.

After doing some analysis, it seems that the cheerio.load will transform a 5MB file into a 150MB - 500MB memory representation. That's a x30 to x100 increase in size.

It would be awesome for us to have a more memory-efficient parser. I have looked into how the htmlparser2 library is used and it seems to me that it could be possible to have a more efficient representation of the elements, but I am not 100% sure how.

Could this type of constraint be something you consider for a future release?
Thank you!

Code snippet used for measurements:

const cheerio = require('cheerio');

const generateWideFile = (siblings) => {
  const elts = `<div>Some Element</div>`.repeat(siblings);
  return `<html><body>${elts}</body></html>`;
}

const testMemoryPressure = () => {
  // const sizes = [100, 200, 300, 400, 500];
  // const sizes = [1000, 2000, 3000, 4000, 5000];
  // const sizes = [10000, 20000, 30000, 40000, 50000];
  const sizes = [100000, 200000, 300000, 400000, 500000];
  global.gc();
  const base = process.memoryUsage().heapUsed;

  const memory = sizes.map((size) => {
    global.gc();
    const html = generateWideFile(size);
    const $ = cheerio.load(html, {
      //_useHtmlParser2: true,
      decodeEntities: true,
      normalizeWhitespace: false,
      xmlMode: false,
    });
    global.gc();
    const usedSize = process.memoryUsage().heapUsed;
    $.html();
    const memory = usedSize - base;
    return { memory, size: html.length, ratio: Math.round(memory / html.length) };
  });

  console.log(memory);
}

testMemoryPressure();

cheerio version: 1.0.0-rc.3

The text was updated successfully, but these errors were encountered:

5saviahv · 2021-07-12T02:39:04Z

This issue is untouched so long ... maybe it is related with #263 and that V8 bug in general?

myfreeer · 2022-06-27T13:05:58Z

Maybe related to #1960

fb55 added the 🙋Question label Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient cheerio.load #1343

Memory efficient cheerio.load #1343

JonathanMontane commented Aug 22, 2019

5saviahv commented Jul 12, 2021

myfreeer commented Jun 27, 2022

Memory efficient cheerio.load #1343

Memory efficient cheerio.load #1343

Comments

JonathanMontane commented Aug 22, 2019

5saviahv commented Jul 12, 2021

myfreeer commented Jun 27, 2022