Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-create all nautilus ZIMs #999

Open
rgaudin opened this issue May 14, 2024 · 5 comments
Open

re-create all nautilus ZIMs #999

rgaudin opened this issue May 14, 2024 · 5 comments
Assignees
Labels
ZIM Update Updating existing ZIM files

Comments

@rgaudin
Copy link
Member

rgaudin commented May 14, 2024

All nautilus ZIMs would benefit from being re-ran:

  • update/fix metadata (several names are incorrect)
  • update/fix collection JSON: with new checks, previously valid schedules will probably fail.
  • all videos must be re-encoded using latest preset
  • benefit from the tiny fixes of latest release (the template bug)
  • get ready for the upcoming UI makeover

Now that nautilus supports URL entries, we may want to switch to URL-based collections and drop the ZIP archive. Advantage is that all files are individually available and replaceable ; collections are easy to extend.
On the other side, it means it's difficult for one to download a full recipe's data and run nautilus locally. @benoit74 WDYT?

Here's the list of all nautilus recipes (zimfarm has no filter for it)

schedule library archive
laboh laboh_fr http://download.kiwix.org/other/bayard/laboh.zip
maitre_lucas_additions_20 maitre_lucas_additions_20_fr https://drive.farm.openzim.org/maitrelucas/Additions%20et%20soustractions%20jusqu_a%CC%80%2020.zip
maitre_lucas_completer_les_algorithmes maitre_lucas_completer_algorithmes_fr https://drive.farm.openzim.org/maitrelucas/Comple%CC%81ter%20les%20algorithmes.zip
maitre_lucas_completer_les_mots maitre_lucas_completer_les_mots_fr https://drive.farm.openzim.org/maitrelucas/Comple%CC%81ter%20les%20mots.zip
maitre_lucas_compter_audela_de_20 maitre_lucas_compter_audela_de_20_fr https://drive.farm.openzim.org/maitrelucas/Compter%20au%20dela%CC%80%20de%2020.zip
maitre_lucas_compter_jusque_10 maitre_lucas_compter_jusque_10_fr https://drive.farm.openzim.org/maitrelucas/Compter%20jusqu_a%CC%80%2010.zip
maitre_lucas_compter_jusque_5 maitre_lucas_compter_jusque_5_fr https://drive.farm.openzim.org/maitrelucas/Compter%20jusqu_a%CC%80%205.zip
maitre_lucas_confusion_des_lettres maitre_lucas_confusion_des_lettres_fr https://drive.farm.openzim.org/maitrelucas/Confusion%20des%20lettres%20p-q%2C%20b-d.zip
maitre_lucas_developpement maitre_lucas_developpement_fr https://drive.farm.openzim.org/maitrelucas/De%CC%81veloppement%20des%20e%CC%82tres%20vivants.zip
maitre_lucas_enseignement_civique maitre_lucas_enseignement_civique_fr https://drive.farm.openzim.org/maitrelucas/Enseignement%20civique%20et%20moral.zip
maitre_lucas_espace_geometrie maitre_lucas_espace_geometrie_fr https://drive.farm.openzim.org/maitrelucas/Espace%20et%20ge%CC%81ome%CC%81trie.zip
maitre_lucas_labyrinthe_de_calculs maitre_lucas_labyrinthe_de_calculs_fr https://drive.farm.openzim.org/maitrelucas/Labyrinthes%20de%20calculs.zip
maitre_lucas_planete_terre maitre_lucas_planete_terre_fr https://drive.farm.openzim.org/maitrelucas/La%20plane%CC%80te%20Terre%20et%20l_environnement.zip
youscribe_college youscribe_fr_college https://drive.farm.openzim.org/youscribe_college%20/youscribe_college.zip
youscribe_lycee youscribe_fr_lycee https://drive.farm.openzim.org/youscribe_lycee%20/youscribe_lycee.zip
japprendsalire japprendsalire_fr https://drive.farm.openzim.org/japprendsalire/japprendsalire.zip
lesbelleshistoires lesbelleshistoires_fr https://drive.farm.openzim.org/lesbelleshistoires/lesbelleshistoires.zip
mesptitesquestions mesptitesquestions_fr https://drive.farm.openzim.org/mesptitesquestions/mesptitesquestions.zip
experiencesscientifiques experiencesscientifiques_fr https://drive.farm.openzim.org/experiencesscientifiques/experiencesscientifiques.zip
scoopyendirectducorpshumain scoopyendirectducorpshumain_fr https://drive.farm.openzim.org/scoopyendirectducorpshumain/scoopyendirectducorpshumain.zip
diksha-std10ssc-marathi diksha-std10ssc_mr https://drive.farm.openzim.org/zaya/std-10-ssc-marathi.zip
bayardcuisine bayardcuisine_fr https://drive.farm.openzim.org/bayardcuisine/bayardcuisine.zip
pink_pookie Ressources_pedagogiques_relatives_au_droit_auteur https://drive.farm.openzim.org/pink_pookie/pinkpookie.zip
jaimelire jaimelire_fr https://drive.farm.openzim.org/jaimelire/jaimelire.zip
prunelle_draw_your_african_story prunelle_draw_your_african_story_en https://drive.farm.openzim.org/prunelle/draw_your_african_story.zip
prunelle_auteurs_en_herbe prunelle_auteurs_en_herbe_fr https://drive.farm.openzim.org/prunelle/auteurs_en_herbe.zip
maitre_lucas_alimentation maitre_lucas_alimentation_fr https://drive.farm.openzim.org/maitrelucas/Alimentation.zip
maitre_lucas_additions maitre_lucas_additions_fr https://drive.farm.openzim.org/maitrelucas/Additions.zip
maitre_lucas_calcul_mental maitre_lucas_calcul_mental_fr https://drive.farm.openzim.org/maitrelucas/Calcul%20mental.zip
maitre_lucas_conjugaison maitre_lucas_conjugaison_fr https://drive.farm.openzim.org/maitrelucas/Conjugaison.zip
maitre_lucas_dictees maitre_lucas_dictees_fr https://drive.farm.openzim.org/maitrelucas/Dicte%CC%81es.zip
maitre_lucas_divisions maitre_lucas_divisions_fr https://drive.farm.openzim.org/maitrelucas/Divisions.zip
maitre_lucas_double_et_moitie maitre_lucas_double_et_moitie_fr https://drive.farm.openzim.org/maitrelucas/Double%20et%20moitie%CC%81.zip
maitre_lucas_ecriture maitre_lucas_ecriture_fr https://drive.farm.openzim.org/maitrelucas/Ecriture.zip
maitre_lucas_enigmes maitre_lucas_enigmes_fr https://drive.farm.openzim.org/maitrelucas/Enigmes.zip
maitre_lucas_fractions maitre_lucas_fractions_fr https://drive.farm.openzim.org/maitrelucas/fractions.zip
maitre_lucas_grammaire maitre_lucas_grammaire_fr https://drive.farm.openzim.org/maitrelucas/Grammaire.zip
maitre_lucas_lecture maitre_lucas_lecture_fr https://drive.farm.openzim.org/maitrelucas/Lecture.zip
zaya-english-duniya-marathi zaya-english-duniya-marathi_mr https://drive.offspot.it/zaya/zaya-s-english-duniya-marathi.zip
maitre_lucas_compter_par_intervalles maitre_lucas_compter_par_intervalles_fr https://drive.farm.openzim.org/maitrelucas/Compter%20par%20intervalles%20re%CC%81guliers.zip
maitre_lucas_comparer_et_ranger maitre_lucas_comparer_et_ranger_fr https://drive.farm.openzim.org/maitrelucas/Comparer%20et%20ranger.zip
maitre_lucas_grandeur_et_mesures maitre_lucas_grandeur_et_mesures_fr https://drive.farm.openzim.org/maitrelucas/Grandeurs%20et%20mesures.zip
maitre_lucas_dictees_de_nombres maitre_lucas_dictees_de_nombres_fr https://drive.farm.openzim.org/maitrelucas/Dicte%CC%81es%20de%20nombres.zip
maitre_lucas_calculs_et_coloriages maitre_lucas_calculs_et_coloriages_magiques_fr https://drive.farm.openzim.org/maitrelucas/Calculs%20et%20Coloriages%20magiques.zip
prunelle_contes_africains prunelle_contes_africains_fr https://drive.farm.openzim.org/prunelle/contes_africains_a_illustrer.zip
maitre_lucas_corps_humain maitre_lucas_corps_humain_fr https://drive.farm.openzim.org/maitrelucas/Corps%20humain%20et%20activite%CC%81.zip
maitre_lucas_compter_jusque_20 maitre_lucas_compter_jusque_20_fr https://drive.farm.openzim.org/maitrelucas/Compter%20jusqu_a%CC%80%2020.zip
prunelle_interactive_books prunelle_interactive_books_en https://drive.farm.openzim.org/prunelle/prunelle_interactive_books.zip
prunelle_budding_authors Prunelle_budding_authors_en https://drive.farm.openzim.org/prunelle/budding_authors.zip
zimgit-food-preparation_en zimgit-food-preparation_en https://drive.farm.openzim.org/zimgit-food-preparation/zimgit-food-preparation.zip
zimgit-post-disaster_en zimgit-post-disaster_en https://drive.farm.openzim.org/zimgit-post-disaster_en/zimgit.zip
editions-ganndal_fr_fo-livres editions-ganndal_fr_fo-livres https://drive.farm.openzim.org/ganndal/ganndal_2024-03.zip
prunelle_livres_interactifs prunelle_livres_interactifs_fr https://drive.farm.openzim.org/prunelle/livres_interactifs_prunelle.zip
maitre_lucas_apprendre_a_dessiner maitre_lucas_apprendre_a_dessiner_fr https://drive.farm.openzim.org/maitrelucas/Apprendre%20a%CC%80%20dessiner.zip
youscribe_audiobooks youscribe_fr_audiobooks https://drive.farm.openzim.org/youscribe_audiobooks/youscribe_audiobooks.zip
zimgit-knots_en zimgit-knots_en https://drive.farm.openzim.org/zimgit-knots/zimgit-knots.zip
mesptitspourquoi mesptitspourquoi_fr https://drive.farm.openzim.org/mesptitspourquoi/mesptitspourquoi.zip
zimgit-water_en zimgit-water_en https://drive.farm.openzim.org/zimgit-water/zimgit-water.zip
alittlequestionaday alittlequestionaday_en https://drive.farm.openzim.org/alittlequestionaday/alittlequestionaday.zip
diksha-std5ssc-english diksha-std5ssc_en https://drive.farm.openzim.org/zaya/std-5-ssc-english.zip
storybox storybox_en https://drive.farm.openzim.org/storybox/storybox.zip
maitre_lucas_comparer_nombres_100 maitre_lucas_comparer_nombres_100_fr https://drive.farm.openzim.org/maitrelucas/Comparer%20les%20nombres%20jusqu_a%CC%80%20100.zip
Terra_x_de Terra_x_de https://commons.wikimedia.org/wiki/Category:Videos_by_Terra_X
zimgit-medicine_en zimgit-medicine_en https://drive.farm.openzim.org/zimgit-medicine/zimgit-medicine.zip
maitre_lucas_calcul_decimaux maitre_lucas_calcul_decimaux_fr https://drive.farm.openzim.org/maitrelucas/Calculs%20avec%20nombres%20de%CC%81cimaux.zip
disledansmalangue disledansmalangue_fr https://drive.farm.openzim.org/disledansmalangue/disledansmalangue.zip
youscribe_primaire youscribe_fr_primaire https://drive.farm.openzim.org/youscribe_primaire%20/youscribe_primaire.zip
lesptitsphilosophes lesptitsphilosophes_fr http://download.kiwix.org/other/bayard/lesptitsphilosophes.zip
poesies poesies_fr https://drive.farm.openzim.org/poesies/poesies.zip
@rgaudin rgaudin added the ZIM Update Updating existing ZIM files label May 14, 2024
@benoit74
Copy link
Contributor

As discussed live, I consider as well that we should indeed expand the Zip on the drive, reencode videos, create a JSON with all individual files URLs, and update the recipe. This is a task for a developer (me probably) since it is too cumbersome / error-prone to do by hand

@rgaudin
Copy link
Member Author

rgaudin commented May 14, 2024

Indeed.

FYI, sample reencode script that can be applied on drive root

import argparse
import logging
import pathlib
import sys

import humanfriendly
from zimscraperlib.video.encoding import reencode
from zimscraperlib.video.presets import VideoWebmLow

logging.basicConfig(level=logging.DEBUG, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
ROOT = pathlib.Path(__file__).parent


def disk_usage(folder):
    return sum(file.stat().st_size for file in folder.glob("**/*"))


def hsize(size):
    return humanfriendly.format_size(size, binary=True)


def main(root: pathlib.Path):

    du = disk_usage(root)
    logger.info(f"re-encoding videos from {root} ({hsize(du)})")

    ffmpeg_args = VideoWebmLow().to_ffmpeg_args()

    errored = []
    for video_fpath in root.rglob("*.webm"):
        logger.info(f"** {video_fpath}")
        if reencode(
            src_path=video_fpath,
            dst_path=video_fpath,
            ffmpeg_args=ffmpeg_args,
            delete_src=True,
            with_process=False,
            failsafe=True,
        ):
            logger.info("  OK")
        else:
            logger.error("  ERROR")
            errored.append(video_fpath)

    final_du = disk_usage(root)
    logger.info(f"new disk-usage: {hsize(final_du)} (diff: {hsize(final_du - du)})")

    if not errored:
        logger.info("ALL OK")
        return

    logger.error(f"{len(errored)} files failed to re-encode:\n- "+ "\n- ".join(errored))



def entrypoint():
    parser = argparse.ArgumentParser(
        prog="re-encode",
        description="re-encode videos using scraperlib",
    )

    parser.add_argument(
        help="Source file path",
        dest="src_path",
    )

    args = parser.parse_args()

    try:
        sys.exit(main(pathlib.Path(args.src_path).expanduser().resolve()))
    except Exception as exc:
        logger.error(f"FAILED. An error occurred: {exc}")
        logger.exception(exc)
        raise SystemExit(1) from exc


if __name__ == "__main__":
    entrypoint()

@kelson42
Copy link
Collaborator

kelson42 commented May 14, 2024

Can we please just do that (redoing the ZIM files) programmaticaly? This is a priority. The rest should be handled separatly and I‘m not in favour of rewritting the ZIP except if really necessary, see for example openzim/nautilus#23

@rgaudin
Copy link
Member Author

rgaudin commented May 14, 2024

Following live discussion:

  • all current recipes to be scheduled ASAP. Those that end up working will create a slightly better ZIM than what exists. Those that dont will provide feedback on what to fix.
  • this will likely require some JSON collection fixes: ZIP created on Windows with accented filenames inside often end-up with broken ZIP-member names as python's zipfile lib decodes the index as UTF-8.
  • Content team may open a dedicated ticket to fix the metadata (Name mostly I think)
  • @benoit74 will expand all ZIP and reencode all webm files.
  • he'll fix all JSON collections to turn ZIP-entries into URL entries.

It is understood we'll reencode those because we know that those are broken webm files and because we don't have the source videos anymore. In a normal situation, we'll store the source video on the drive and the (yet to be implemented) nautilus-included encoder will optimize it.

@kelson42
Copy link
Collaborator

kelson42 commented May 25, 2024

I have rescheduled all 47 recipes based on the „nautilus“ tag and after fixing a few ones (Nautilus 1.2 has a better Metadata conformity check), they have all passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ZIM Update Updating existing ZIM files
Projects
None yet
Development

No branches or pull requests

3 participants