Scrape multiple terms #3

chrismytton · 2017-07-10T15:19:28Z

Update the scraper to use the new per-term views added in mysociety/pombola#2287. This allows us to scrape term 25 and 26, rather than having 26 hard-coded in the scraper.

Part of everypolitician/everypolitician-data#42096

tmtmtmtm

Yay! — though it would be good to move this at least one step towards where we're trying to get all the scrapers to (with separate data-gathering / data-storing steps), rather than pushing the old approach an extra step in the wrong direction.

I think there's also going to be a slightly subtle problem here as well in that this change is essentially switching the scraper from fetching current members, to fetching all members. But as it doesn't pay attention to start and end dates, it's going to end up with data that's incorrect (i.e. including people as currently active who aren't.)

It's still an improvement over what we currently have, so if you plan to handle that in a separate PR, I'm OK with that. But if this is something you just hadn't noticed, it might be worth trying to also add fields for those.

tmtmtmtm · 2017-07-11T04:46:20Z

scraper.rb

  scrape_list(page.next_page) unless page.next_page.to_s.empty?
 end

-def scrape_person(url)
-  data = scrape(url => MemberPage).to_h
+def scrape_person(url, term:)


it's a little bit clumsy to need to pass the 'term' around here, when the only reason it's needed is for combining with the data that this method is gathering. The method is already doing two things, so this is largely an inherited problem, but I think this is the time to rectify that, rather than pushing it even further in the wrong direction. So I think it would be much cleaner to take this opportunity to turn this into a person_data method which simply returns the scraped data for the person, and hoist the save_sqlite out of this method.

(The ideal would be to split the save_sqlite out entirely from the data-gathering, so there's just a single delete and re-save after all the scraping, thus ensuring the data doesn't end up in a half-baked state if something fails half-way through, as per current conventions, but that's not quite as necessary here.)

tmtmtmtm · 2017-07-11T04:58:15Z

The change in the test here is also a little bit out of place, as it's not really connected to this change, but is really just reflecting the site's switch to https everywhere. I've factored that out as #6

Rather than scraping all people in the `national-assembly` organization this instead switches to the /position/ route which allows you to specify a term to scrape. This also means we can avoid hard-coding the term in MemberPage, since we can now parse it from the URL.

Now that we can do term-specific scraping we can add term 25 to the list of terms that we're interested in.

jacksonj04 · 2019-06-07T15:27:31Z

Closed in favour of #15.

chrismytton force-pushed the scrape-multiple-terms branch from 7210e12 to 972ccb4 Compare July 10, 2017 15:24

chrismytton requested a review from tmtmtmtm July 10, 2017 16:52

chrismytton assigned tmtmtmtm Jul 10, 2017

tmtmtmtm suggested changes Jul 11, 2017

View reviewed changes

tmtmtmtm assigned chrismytton and unassigned tmtmtmtm Jul 11, 2017

chrismytton added 2 commits July 11, 2017 15:40

Scrape National Assembly term 25

ec7ab4a

Now that we can do term-specific scraping we can add term 25 to the list of terms that we're interested in.

chrismytton force-pushed the scrape-multiple-terms branch from 972ccb4 to ec7ab4a Compare July 11, 2017 16:57

chrismytton removed their assignment Nov 27, 2017

jacksonj04 mentioned this pull request Apr 9, 2018

Return all members of 26th National Assembly session #12

Merged

jacksonj04 mentioned this pull request Jun 7, 2019

Add the ability to scrape multiple terms #15

Merged

jacksonj04 closed this Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape multiple terms #3

Scrape multiple terms #3

chrismytton commented Jul 10, 2017 •

edited

Loading

tmtmtmtm left a comment

tmtmtmtm Jul 11, 2017

tmtmtmtm commented Jul 11, 2017 •

edited

Loading

jacksonj04 commented Jun 7, 2019

Scrape multiple terms #3

Scrape multiple terms #3

Conversation

chrismytton commented Jul 10, 2017 • edited Loading

tmtmtmtm left a comment

Choose a reason for hiding this comment

tmtmtmtm Jul 11, 2017

Choose a reason for hiding this comment

tmtmtmtm commented Jul 11, 2017 • edited Loading

jacksonj04 commented Jun 7, 2019

chrismytton commented Jul 10, 2017 •

edited

Loading

tmtmtmtm commented Jul 11, 2017 •

edited

Loading