Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Pre-processing #2

Open
1 of 4 tasks
govvin opened this issue Feb 20, 2022 · 3 comments
Open
1 of 4 tasks

Data Pre-processing #2

govvin opened this issue Feb 20, 2022 · 3 comments

Comments

@govvin
Copy link
Member

govvin commented Feb 20, 2022

  • @govvin : sub-divide the dataset, to allow other volunteers to work concurrently on the RBI dataset clean-up (normalization, fix typos, standard name styles, etc) requirements, using the bridge id as the common key
  • Clean-up bridge names. Tasks are in another comment below.
  • @govvin : road names ☑️ clean-up road names
  • @govvin : combine, review, and finalize complete dataset to be used for the mapping challenge

Formating and Style Guide

See https://github.com/OSMPH/dpwh_bridges/wiki/Processing/_edit#formating-and-style-guidelines-aka-style-manual

@govvin
Copy link
Member Author

govvin commented Feb 20, 2022

Province-level clean-up tasks:

To do

  • Review bridge name. Fix issues. (e.g. spelling out abbreviations, fixing typos, etc.)
  • Review the BRGY, MUN values again. They've been given a quick pass, and another pass is helpful.
    • Some BRGY values are missing. Try to identify them, if you can.
    • Records are grouped by province, and in most cases, it's possible to finish each in one sitting.
  • Please finish your selected task before selecting another task. Do not "reserve" tasks.
  • Put a tick mark on your selected province and download the corresponding file from the pre-processing folder. Use your favorite editor for the data clean-up.
  • Once you complete the task, upload the file via a PR request or thru the Telegram channel. See the readme.md

Tasks:

  • ABR - Abra
  • AGN - Agusan del Norte
  • AGS - Agusan del Sur
  • AKL - Aklan
  • ALB - Albay
  • ANT - Antique
  • APA - Apayao
  • AUR - Aurora
  • BAS - Basilan
  • BAN - Bataan
  • BTN - Batanes
  • BTG - Batangas
  • BEN - Benguet
  • BIL - Biliran
  • BOH - Bohol
  • BUK - Bukidnon
  • BUL - Bulacan
  • CAG - Cagayan
  • CAN - Camarines Norte
  • CAS - Camarines Sur
  • CAM - Camiguin
  • CAP - Capiz
  • CAT - Catanduanes
  • CAV - Cavite
  • CEB - Cebu
  • NCO - Cotabato
  • COM - Davao de Oro
  • DAV - Davao del Norte
  • DAS - Davao del Sur
  • DVO - Davao Occidental
  • DAO - Davao Oriental
  • DIN - Dinagat Islands
  • EAS - Eastern Samar
  • GUI - Guimaras
  • IFU - Ifugao
  • ILN - Ilocos Norte
  • ILS - Ilocos Sur
  • ILI - Iloilo
  • ISA - Isabela
  • KAL - Kalinga
  • LUN - La Union
  • LAG - Laguna
  • LAN - Lanao del Norte
  • ❌ LAS - Lanao del Sur - No data records found in dataset
  • LEY - Leyte
  • MAG - Maguindanao
  • MAD - Marinduque
  • MAS - Masbate
  • MDC - Mindoro Occidental
  • MDR - Mindoro Oriental
  • MSC - Misamis Occidental
  • MSR - Misamis Oriental
  • MOU - Mountain Province
  • NEC - Negros Occidental
  • NER - Negros Oriental
  • NSA - Northern Samar
  • NUE - Nueva Ecija
  • NUV - Nueva Vizcaya
  • PLW - Palawan
  • PAM - Pampanga
  • PAN - Pangasinan
  • QUE - Quezon
  • QUI - Quirino
  • RIZ - Rizal
  • ROM - Romblon
  • WSA - Samar
  • SAR - Sarangani
  • SIG - Siquijor
  • SOR - Sorsogon
  • SCO - South Cotabato
  • SLE - Southern Leyte
  • SUK - Sultan Kudarat
  • ❌ SLU - Sulu. No data records found.
  • SUN - Surigao del Norte
  • SUR - Surigao del Sur
  • TAR - Tarlac
  • ❌ TAW - Tawi-Tawi. No data records found.
  • ZMB - Zambales
  • ZAN - Zamboanga del Norte
  • ZAS - Zamboanga del Sur
  • ZSI - Zamboanga Sibugay

Sorry, something went wrong.

@mdgabriel1
Copy link
Contributor

mdgabriel1 commented Feb 26, 2022

Hi @govvin ,
Completed checking the following provinces:

  • ABR - Abra
  • AGN - Agusan del Norte
  • AGS - Agusan del Sur
  • AKL - Aklan
  • ALB - Albay
  • ANT - Antique
  • APA - Apayao
  • AUR - Aurora
  • BAN - Bataan
  • BTG - Batangas
  • BIL - Biliran
  • BTN - Batanes
  • BUK - Bukidnon
  • BUL - Bulacan

Sorry, something went wrong.

@thiscaspar
Copy link

thiscaspar commented Oct 2, 2022

I used a script to do some mass-cleanup (see my fork)

Result is in a googlesheet for easy review: https://docs.google.com/spreadsheets/d/1tPG7NJx7EuEXjY7HCg8oY9Cyqwz1H0JrXsSf5dyE_k0/edit#gid=893538581

Barangay/Municipalities are mostly fixed automatically. To give you an idea, this is the code:

function cleanName(str) {
    return str
        .replaceAll('Br.', 'Bridge')
        .replace('(NB)', ' (Northbound)')
        .replace(' NB)', ' Northbound)')
        .replace('NB)', ' Northbound)')
        .replace(' NB ', ' Northbound ')
        .replace('(SB)', '(Southbound)')
        .replace(' SB ', ' Southbound ')
        .replace('(WB)', '(Westbound)')
        .replace(' WB ', ' Westbound ')
        .replace('(EB)', '(Eastbound)')
        .replace(' EB ', ' Eastbound ')
        .replace('Gov. ', 'Governor ')
        .replace('Arch Reyes', 'Archbishop Reyes')
        .replace(/(.*)( \d)/g, "$1 №$2")
        .replace('  ', ' ')
        .replace('  ', ' ')
        .replace('( ', '(')
}

function cleanMunicipality(str) {
    return str
        .replace(/\s+/g, ' ').trim()
        .replace(/$\s(.*)/, "$1")
        .toLowerCase()
        .split(' ')
        .map(word => word.charAt(0).toUpperCase() + word.substring(1))
        .join(' ')
        .replace('Sta.', 'Santa')
        .replace('Sta ', 'Santa ')
        .replace('Zambonga City', 'Zamboanga City')
        .replace("Brookes Point", "Brooke's Point")
        .replace("Brook's Point", "Brooke's Point")
        .replace("Busuanga, Palawan", "Busuanga")
        .replace(", Cebu", "")
        .replace(", Sorsogon City", "")
        .replace(", Ilocos Norte", "")
        .replace(", Rizal", "")
        .replace(",lanao Del Norte", "")
        .replace(", Lanao Del Norte", "")
        .replace(", Agusan Del Sur", "")
        .replace(", Province Of Dinagat Islands", "")
        .replace(", Leyte", "")
        .replace(", N. Samar", "")
        .replace(", Quezon", "")
        .replace(",zds.", "")
        .replace(", Cam. Sur", "")
        .replace(", N Samar", "")
        .replace(",capiz", "")
        .replace(", Ilocos Sur", "")
        .replace(",zamboanga Del Sur", "")
        .replace(", Zamboanga Del Sur", "")
        .replace(" ,tarlac", "")
        .replace(", Albay", "")
        .replace(", Palawan", "")
        .replace(", Northern Samar", "")
        .replace(",tarlac", "")
        .replace(", Zds.", "")
        .replace("Sergio Osmena, Sr.", "Sergio Osmeña")
}

function cleanBarangay(str) {
    return str
        .replace(/\s+/g, ' ').trim()
        .replace(/$\s(.*)/, "$1")
        .toLowerCase()
        .split(' ')
        .map(word => word.charAt(0).toUpperCase() + word.substring(1))
        .join(' ')
        .replace('Brgy. ', '')
        .replace('Bgy. ', '')
        .replace('Barangay ', '')
        .replace('Sta.', 'Santa')
        .replace('Sta ', 'Santa ')
        .replace('Sto.', 'Santo')
        .replace('Sto ', 'Santo ')
        .replace('Brgys.', 'Barangays')
        .replace('Brgys.', 'Barangays')
        .replace('Pob.', 'Poblacion')
        .replace('Pobl;acion', 'Poblacion')
        .replace('Herero-perez', 'Herrero-Perez')
        .replace('New Bususnga', 'New Busuanga')
}

There are still 69 missing Barangays, and 19 missing municipalities. Some are not "clean" yet (having region in it, or messy formatting for multiple Barangays).

I stuck to using " № X" for bridge numbering, it seems the cleanest.

Hope this helps, let me know if anything needs adjustment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants