[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

metaprime · 2025-01-06T20:27:11Z

Some trivial rippers are a matter of finding one or more CSS patterns that select all of the elements on the page that you want to rip, maybe specifying a naming scheme, a rate limit etc.

Look at some simple examples to notice what they have in common. Need to design a JSON format and algorithm that lets this be more declarative. There's a good chance we can remove a lot of the boilerplate involved in adding and maintaining rippers. Maybe it should be even easier for people to contribute their own.

I think this could be a huge productivity win for keeping up with gestures broadly the ever-changing internet.

metaprime · 2025-01-07T04:26:26Z

It occurs to me that the current approach of finding all the classes and trying to construct them makes the program easily extensible, but we'd need to add at least one other mode of creating and loading rippers.

The JSON descriptions would have to be instantiated as instances of DeclarativeRipperLogic at runtime and selected by running a different kind of query than a "try to construct each one". Which might actually be way more efficient actually. Probably not a problem. This might mean that such a ripper can only be initialized for one type of URL at a time. We'll have to think about how that scales up for queues of lots of items to rip at a time. (Are these heavy instances? Do they all live in a queue, or do the URLs live in a queue and get dispatched at runtime?)

An interesting design problem, in part because of how the app currently works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

metaprime commented Jan 6, 2025

metaprime commented Jan 7, 2025 •

edited

Loading

[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

Comments

metaprime commented Jan 6, 2025

metaprime commented Jan 7, 2025 • edited Loading

metaprime commented Jan 7, 2025 •

edited

Loading