Skip to content

Source Code Explaination

David edited this page Apr 7, 2022 · 2 revisions

This is a full write up of the source code. I worked pretty hard on this whole thing so it is a good learning point for others as well.

This was made on 4/6/2022, in reference to release v1.0.0

1. grabber.py

  • Lines 1-14 are just importing libraries. Basic stuff.
  • Line 16 references the remote selenium client from the .env file. This is just a link to a remote selenium client if you choose to run on a server. On my server that I host the website, everything runs in a docker, so I just use the selenium grid container.
  • Lines 20-28 is the selenium chrome browser. In here we are just saying that we want to reference the webdriver and use the options given. In here I have included the latest uBlock Origin adblocker, which is necessary when visiting (not really) unsafe websites.
  • From here the code gets pretty convoluted, so I will only say a line when necessary. Next beautifulsoup4 takes over. We start by opening the Input Data.txt file which (in order) hosts our the assigned site name which can be anything. The container tag of where the links are stored on the website, usually a ul element. For example: firgirl repacks Fitgirl-repacks would have a container tag of ul. Next is the class tag, which in Fitgirl-repacks would be lcp_catlist which would get us our links. Next is the html, which is the full link of the website, so for Fitgirl-repacks, it would be https://fitgirl-repacks.site/all-my-repacks-a-z/. Finally is the domain, which for Fitgirl-repacks, would just be https://fitgirl-repacks.site/. It only references the top level domain.
  • It then searches through each of those elements given for an a element, which in html is a link.
  • Once it finds that there are no more a elements, it moves to the next page, which is referenced in the Index Data.txt as well. For example: fitgirl next page The element would be lcp_nextlink. Once it finds that there are no more "next pages" it then moves onto the next link and so on, collecting links from the sites and storing it. Along the way it checks for popups on the site and closes them which is an element in Index Data.txt.
  • Finally it outputs it to a json file to be used by the next script. Be warned, if the program is stopped halfway, nothing will be saved; the links are only saved to the json at the end, to save on writes.
  • It then closes the webdriver to save money and runs the next script which is the cleaner.

2. cleaner.py

  • This script is a lot more simple. This script is responsible for removing any link that we want from the json file. On line 1 all we do is import the json library as that is all we will use for this one and on the next lines it opens the json file we just saved with all the links in it.
  • Lines 8-17 is the first solution I came up with, it only handles one link total, so if you want more than 1 link removed, then lines 17-27 are for you.
  • Basically, all it does is read the key and value you give it, and if it finds it in the json file, it deletes it. The syntax for this is basic json, 'key1':['target1','target2'],'key2':['target1','target2']. For example, if you want to remove https://masquerade.site#a-z-listing-1 from the list, it would look like, 'marked':['https://masquerade.site#a-z-listing-1'],...,...
  • If it doesn't find the keys then it just tells you and moves on to save the json file. If it does find them, it deletes them and moves on to save the json file, and finally runs the next script which is the search formatter.

3. forsearch.py

  • This script gets the cleaned json file ready to be searched through. Forsearch.py's job is to make our json file compatible with Meilisearch. According to Meilisearch docs:

"A document is an object composed of one or more fields. Each field consists of an attribute and its associated value." document format

  • In line 22 we are specifying that we want a 10 length random string id to be assigned to each link.
  • So in line 26 we are saying that for each link we want to assign a random 10 length id to each link, assign it a name by splitting it by common values and removing the part that makes it look like a link, and then format it into a clean json format to be read by Meilisearch. This isn't that complex as it is just formatting and generating random ids.
  • It then saves the formatted json and runs the final script which sends everything to the server.

4. sendtosearch.py

  • This script is responsible for sending all the data and the final json file to the server. This sends the json file to the Meilisearch api.
  • Within 10 lines, the whole thing is sent and finished. including everything for the search api, which is super easy, and I love Meilisearch for making it so easy. We are just getting the json and deleting the old one, and sending the new one out. Easy-peasy.

And that's everything! It was really fun to work on this project, and thank you everyone!