-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(webscraper): add implementation from masa-oracle #12
Conversation
Signed-off-by: mudler <[email protected]>
Signed-off-by: mudler <[email protected]>
Signed-off-by: mudler <[email protected]>
Signed-off-by: mudler <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12 +/- ##
==========================================
+ Coverage 56.20% 57.35% +1.14%
==========================================
Files 9 6 -3
Lines 290 340 +50
==========================================
+ Hits 163 195 +32
- Misses 107 124 +17
- Partials 20 21 +1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this considering that the "WIP" status mentioned will be addressed in a separate PR. Otherwise ignore the approval :)
Error string `json:"error"` | ||
Data interface{} `json:"data"` | ||
Error string `json:"error"` | ||
Data []byte `json:"data"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? We're unmarshalling twice (once to a []byte
, and from there to the actual data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rationale behind this change it's that when using interface{}
it forces the caller to guess the underlying type: while this is not an urgent issue it made look the code quite ugly to see, and eventually added more complexity. I found out the switch to increase readibility and avoided an extra step.
when using interfaces the result was a marshal of an interface, that had two consequences:
- The receiver would have to unmarshal it to known kind, so who populate the result could inadvertitely nest interfaces and make it un-easy to unmarshal
- If the result let's say was a string (like most of the cases) the API would always return something like
"data"
including the"
, becauseinterface{}
was marshaled to a string type
type Section struct { | ||
Title string `json:"title"` // Title is the heading text of the section. | ||
Paragraphs []string `json:"paragraphs"` // Paragraphs contains all the text content of the section. | ||
Images []string `json:"images"` // Images storing base64 - maybe!!? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably need more than just the base64 contents, at the very least a MIME type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes, I think there is much room for improvement here. At this stage I'm not touching any of the scraper code. I've actually just copy-pasted this part from the existing implementation: https://github.com/masa-finance/masa-oracle/blob/main/pkg/scrapers/web/web.go
// } | ||
// logrus.WithField("result", string(res)).Info("Scraping completed") | ||
// }() | ||
func scrapeWeb(uri []string, depth int) ([]byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small nit: shouldn't depth
be a uint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep makes totally sense 👍 just note I've copied the existing code as-is and wasn't my intention to change it at all
Signed-off-by: mudler <[email protected]>
This PR adds the webscraper code from masa-oracle, adapted to run as a separate job.
WIP - want to refactor out the golang client around a job result so it's easier to consume as API, basically addressing the comment in the code.