Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(webscraper): add implementation from masa-oracle #12

Merged
merged 5 commits into from
Nov 25, 2024
Merged

Conversation

mudler
Copy link
Contributor

@mudler mudler commented Nov 22, 2024

This PR adds the webscraper code from masa-oracle, adapted to run as a separate job.

WIP - want to refactor out the golang client around a job result so it's easier to consume as API, basically addressing the comment in the code.

Copy link

codecov bot commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 36.30137% with 93 lines in your changes missing coverage. Please review.

Project coverage is 57.35%. Comparing base (3d1156d) to head (205458e).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
internal/jobs/webscraper.go 51.61% 38 Missing and 7 partials ⚠️
pkg/client/result.go 0.00% 26 Missing ⚠️
pkg/client/http.go 15.38% 22 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #12      +/-   ##
==========================================
+ Coverage   56.20%   57.35%   +1.14%     
==========================================
  Files           9        6       -3     
  Lines         290      340      +50     
==========================================
+ Hits          163      195      +32     
- Misses        107      124      +17     
- Partials       20       21       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@mudler mudler marked this pull request as ready for review November 23, 2024 10:28
@mudler mudler requested a review from a team November 25, 2024 15:36
Copy link

@mcamou mcamou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this considering that the "WIP" status mentioned will be addressed in a separate PR. Otherwise ignore the approval :)

Error string `json:"error"`
Data interface{} `json:"data"`
Error string `json:"error"`
Data []byte `json:"data"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? We're unmarshalling twice (once to a []byte, and from there to the actual data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rationale behind this change it's that when using interface{} it forces the caller to guess the underlying type: while this is not an urgent issue it made look the code quite ugly to see, and eventually added more complexity. I found out the switch to increase readibility and avoided an extra step.

when using interfaces the result was a marshal of an interface, that had two consequences:

  1. The receiver would have to unmarshal it to known kind, so who populate the result could inadvertitely nest interfaces and make it un-easy to unmarshal
  2. If the result let's say was a string (like most of the cases) the API would always return something like "data" including the ", because interface{} was marshaled to a string type

type Section struct {
Title string `json:"title"` // Title is the heading text of the section.
Paragraphs []string `json:"paragraphs"` // Paragraphs contains all the text content of the section.
Images []string `json:"images"` // Images storing base64 - maybe!!?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need more than just the base64 contents, at the very least a MIME type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, I think there is much room for improvement here. At this stage I'm not touching any of the scraper code. I've actually just copy-pasted this part from the existing implementation: https://github.com/masa-finance/masa-oracle/blob/main/pkg/scrapers/web/web.go

// }
// logrus.WithField("result", string(res)).Info("Scraping completed")
// }()
func scrapeWeb(uri []string, depth int) ([]byte, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small nit: shouldn't depth be a uint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep makes totally sense 👍 just note I've copied the existing code as-is and wasn't my intention to change it at all

@mudler mudler merged commit 8af6478 into main Nov 25, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants