Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape multiple terms #3

Closed
wants to merge 2 commits into from
Closed

Scrape multiple terms #3

wants to merge 2 commits into from

Conversation

chrismytton
Copy link
Collaborator

@chrismytton chrismytton commented Jul 10, 2017

Update the scraper to use the new per-term views added in mysociety/pombola#2287. This allows us to scrape term 25 and 26, rather than having 26 hard-coded in the scraper.

Part of everypolitician/everypolitician-data#42096

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! — though it would be good to move this at least one step towards where we're trying to get all the scrapers to (with separate data-gathering / data-storing steps), rather than pushing the old approach an extra step in the wrong direction.

I think there's also going to be a slightly subtle problem here as well in that this change is essentially switching the scraper from fetching current members, to fetching all members. But as it doesn't pay attention to start and end dates, it's going to end up with data that's incorrect (i.e. including people as currently active who aren't.)

It's still an improvement over what we currently have, so if you plan to handle that in a separate PR, I'm OK with that. But if this is something you just hadn't noticed, it might be worth trying to also add fields for those.

scrape_list(page.next_page) unless page.next_page.to_s.empty?
end

def scrape_person(url)
data = scrape(url => MemberPage).to_h
def scrape_person(url, term:)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a little bit clumsy to need to pass the 'term' around here, when the only reason it's needed is for combining with the data that this method is gathering. The method is already doing two things, so this is largely an inherited problem, but I think this is the time to rectify that, rather than pushing it even further in the wrong direction. So I think it would be much cleaner to take this opportunity to turn this into a person_data method which simply returns the scraped data for the person, and hoist the save_sqlite out of this method.

(The ideal would be to split the save_sqlite out entirely from the data-gathering, so there's just a single delete and re-save after all the scraping, thus ensuring the data doesn't end up in a half-baked state if something fails half-way through, as per current conventions, but that's not quite as necessary here.)

@tmtmtmtm tmtmtmtm assigned chrismytton and unassigned tmtmtmtm Jul 11, 2017
@tmtmtmtm
Copy link
Contributor

tmtmtmtm commented Jul 11, 2017

The change in the test here is also a little bit out of place, as it's not really connected to this change, but is really just reflecting the site's switch to https everywhere. I've factored that out as #6

Rather than scraping all people in the `national-assembly` organization
this instead switches to the /position/ route which allows you to
specify a term to scrape.

This also means we can avoid hard-coding the term in MemberPage, since
we can now parse it from the URL.
Now that we can do term-specific scraping we can add term 25 to the list
of terms that we're interested in.
@jacksonj04
Copy link
Contributor

Closed in favour of #15.

@jacksonj04 jacksonj04 closed this Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants