Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties #2071

Open
metaprime opened this issue Jan 6, 2025 · 1 comment

Comments

@metaprime
Copy link
Contributor

Some trivial rippers are a matter of finding one or more CSS patterns that select all of the elements on the page that you want to rip, maybe specifying a naming scheme, a rate limit etc.

Look at some simple examples to notice what they have in common. Need to design a JSON format and algorithm that lets this be more declarative. There's a good chance we can remove a lot of the boilerplate involved in adding and maintaining rippers. Maybe it should be even easier for people to contribute their own.

I think this could be a huge productivity win for keeping up with gestures broadly the ever-changing internet.

@metaprime metaprime changed the title [Proposal] Add a DeclarativeRipper class, constructed from json description of how to find content on a page with a few properties [Proposal] Add a (meta) DeclarativeRipperLogic class, constructed from json description of how to find content on a page with a few properties Jan 7, 2025
@metaprime
Copy link
Contributor Author

metaprime commented Jan 7, 2025

It occurs to me that the current approach of finding all the classes and trying to construct them makes the program easily extensible, but we'd need to add at least one other mode of creating and loading rippers.

The JSON descriptions would have to be instantiated as instances of DeclarativeRipperLogic at runtime and selected by running a different kind of query than a "try to construct each one". Which might actually be way more efficient actually. Probably not a problem. This might mean that such a ripper can only be initialized for one type of URL at a time. We'll have to think about how that scales up for queues of lots of items to rip at a time. (Are these heavy instances? Do they all live in a queue, or do the URLs live in a queue and get dispatched at runtime?)

An interesting design problem, in part because of how the app currently works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant