-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape multiple terms #3
Conversation
7210e12
to
972ccb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! — though it would be good to move this at least one step towards where we're trying to get all the scrapers to (with separate data-gathering / data-storing steps), rather than pushing the old approach an extra step in the wrong direction.
I think there's also going to be a slightly subtle problem here as well in that this change is essentially switching the scraper from fetching current members, to fetching all members. But as it doesn't pay attention to start and end dates, it's going to end up with data that's incorrect (i.e. including people as currently active who aren't.)
It's still an improvement over what we currently have, so if you plan to handle that in a separate PR, I'm OK with that. But if this is something you just hadn't noticed, it might be worth trying to also add fields for those.
scrape_list(page.next_page) unless page.next_page.to_s.empty? | ||
end | ||
|
||
def scrape_person(url) | ||
data = scrape(url => MemberPage).to_h | ||
def scrape_person(url, term:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a little bit clumsy to need to pass the 'term' around here, when the only reason it's needed is for combining with the data that this method is gathering. The method is already doing two things, so this is largely an inherited problem, but I think this is the time to rectify that, rather than pushing it even further in the wrong direction. So I think it would be much cleaner to take this opportunity to turn this into a person_data
method which simply returns the scraped data for the person, and hoist the save_sqlite
out of this method.
(The ideal would be to split the save_sqlite
out entirely from the data-gathering, so there's just a single delete and re-save after all the scraping, thus ensuring the data doesn't end up in a half-baked state if something fails half-way through, as per current conventions, but that's not quite as necessary here.)
The change in the test here is also a little bit out of place, as it's not really connected to this change, but is really just reflecting the site's switch to https everywhere. I've factored that out as #6 |
Rather than scraping all people in the `national-assembly` organization this instead switches to the /position/ route which allows you to specify a term to scrape. This also means we can avoid hard-coding the term in MemberPage, since we can now parse it from the URL.
Now that we can do term-specific scraping we can add term 25 to the list of terms that we're interested in.
972ccb4
to
ec7ab4a
Compare
Closed in favour of #15. |
Update the scraper to use the new per-term views added in mysociety/pombola#2287. This allows us to scrape term 25 and 26, rather than having
26
hard-coded in the scraper.Part of everypolitician/everypolitician-data#42096