-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
174 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
--- | ||
title: "Introduction to geocodebr" | ||
date: "`r Sys.Date()`" | ||
output: rmarkdown::html_vignette | ||
code-annotations: hover | ||
urlcolor: blue | ||
vignette: > | ||
%\VignetteIndexEntry{Introduction to geocodebr} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
\usepackage[utf8]{inputenc} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>", | ||
eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true"), | ||
out.width = "100%" | ||
) | ||
``` | ||
|
||
Geocoding refers to the act of finding a point in space, usually represented by | ||
a pair of coordinates, given a street address. The **geocodebr** packages allows | ||
one to efficiently geocode Brazilian addresses using the National Registry of | ||
Addresses for Statistical Purposes (english for *Cadastro Nacional de Endereços | ||
para Fins Estatísticos*, CNEFE)[^1], a data set collected and | ||
[published](https://www.ibge.gov.br/estatisticas/sociais/populacao/38734-cadastro-nacional-de-enderecos-para-fins-estatisticos.html) | ||
by the Brazilian official statistics and geography office, IBGE, containing the | ||
addresses of more than 100 million households and establishments in Brazil. | ||
|
||
## Basic usage | ||
|
||
Before using *geocodebr*, please make sure you have it installed in your | ||
computer. You can download either the most stable version from CRAN... | ||
|
||
```{r, eval = FALSE} | ||
install.packages("geocodebr") | ||
``` | ||
|
||
... or the development version from GitHub. | ||
|
||
```{r, eval = FALSE} | ||
# install.packages("pak") | ||
pak::pak("ipeaGIT/geocodebr") | ||
``` | ||
|
||
Then attach it to the current R session: | ||
|
||
```{r} | ||
library(geocodebr) | ||
``` | ||
|
||
The main entry point to the package's functionalities is `geocode()`, which | ||
takes a data frame of addresses as input and outputs the same data frame with | ||
the latitude and longitude of each matched address, as well as two columns | ||
indicating the precision level of the matches. To demonstrate its usage, the | ||
package includes a few sample data sets in the installation. In the example | ||
below, we use a small data set that contains addresses with commonly seen | ||
issues, such as missing information and mistyped fields. | ||
**Note:** Running the function for the first time may take a while, since | ||
*geocodebr* needs to download the CNEFE data, which sums up to about 5.5 GB. | ||
Alternatively, you can use `download_cnefe()` to download the data before | ||
geocoding (`geocode()` does that behind the scenes). | ||
```{r} | ||
input_addresses <- read.csv( | ||
system.file("extdata/small_sample.csv", package = "geocodebr") | ||
) | ||
result <- geocodebr::geocode( | ||
input_addresses, | ||
address_fields = geocodebr::setup_address_fields( | ||
logradouro = "nm_logradouro", | ||
numero = "Numero", | ||
cep = "Cep", | ||
bairro = "Bairro", | ||
municipio = "nm_municipio", | ||
estado = "nm_uf" | ||
), | ||
progress = FALSE | ||
) | ||
head(result) | ||
``` | ||
obs. Note that the first time the user runs this function, {geocodebr} will download a few files and store them locally. This way, the data only needs to be downloaded once. More info about data caching below. | ||
The output coordinates use the official geodetic reference system used in Brazil: SIRGAS2000, CRS(4674). The results of {geocodebr} are classified into six broad `precision` categories depending on how exactly each input address was matched with CNEFE data. The accuracy of the results are indicated in two columns of the output: `precision` and `match_type`. More information below. | ||
# Precision categories: | ||
The results of {geocodebr} are classified into six broad `precision` categories: | ||
- "numero" | ||
- "numero_interpolado" | ||
- "rua" | ||
- "cep" | ||
- "bairro" | ||
- "municipio" | ||
- `NA` (not found) | ||
Each precision level can be disaggregated into more refined match types. | ||
## Match Type | ||
The column `match_type` provides more refined information on how exactly each input address was matched with CNEFE. In every category, {geocodebr} takes the average latitude and longitude of the addresses included in CNEFE that match the input address based on combinations of different fields. In the strictest case, for example, the function finds a deterministic match for all of the fields of a given address (`"estado"`, `"municipio"`, `"logradouro"`, `"numero"`, `"cep"`, `"localidade"`). Think for example of a building with several apartments that match the same street address and number. In such case, the coordinates of the apartments will differ very slightly, and {geocodebr} takes the average | ||
of those coordinates. In a less rigorous example, in which only the fields (`"estado"`, `"municipio"`, `"logradouro"`, `"localidade"`) are matched, {geocodebr} calculates the average coordinates of all the addresses in CNEFE along that street and which fall within the same neighborhood. | ||
The complete list of precision levels, their corresponding match type categories and the fields considered in each category are described below: | ||
- precision: **"numero"** | ||
- match_type: | ||
- en01: logradouro, numero, cep e bairro | ||
- en02: logradouro, numero e cep | ||
- en03: logradouro, numero e bairro | ||
- en04: logradouro e numero | ||
- pn01: logradouro, numero, cep e bairro | ||
- pn02: logradouro, numero e cep | ||
- pn03: logradouro, numero e bairro | ||
- pn04: logradouro e numero | ||
- precision: **"numero_interpolado"** | ||
- match_type: | ||
- ei01: logradouro, numero, cep e bairro | ||
- ei02: logradouro, numero e cep | ||
- ei03: logradouro, numero e bairro | ||
- ei04: logradouro e numero | ||
- pi01: logradouro, numero, cep e bairro | ||
- pi02: logradouro, numero e cep | ||
- pi03: logradouro, numero e bairro | ||
- pi04: logradouro e numero | ||
- precision: **"rua"** (when input number is missing 'S/N') | ||
- match_type: | ||
- er01: logradouro, cep e bairro | ||
- er02: logradouro e cep | ||
- er03: logradouro e bairro | ||
- er04: logradouro | ||
- pr01: logradouro, cep e bairro | ||
- pr02: logradouro e cep | ||
- pr03: logradouro e bairro | ||
- pr04: logradouro | ||
- precision: **"cep"** | ||
- match_type: | ||
- ec01: municipio, cep, localidade | ||
- ec02: municipio, cep | ||
- precision: **"bairro"** | ||
- match_type: | ||
- eb01: municipio, localidade | ||
- precision: **"municipio"** | ||
- match_type: | ||
- em01: municipio | ||
***Note:*** Match types starting with 'p' use probabilistic matching of the logradouro field, while types starting with 'e' use deterministic matching only. **Match types with probabilistic matching are not implemented in {geocodebr} yet**. | ||
# Data cache | ||
The first time the user runs the `geocode()` function, {geocodebr} will download a few reference files and store them locally. This way, the data only needs to be downloaded once. Mind you that these files require approximately 4GB of space in your local drive. | ||
The package includes the following functions to help users manage cached files: | ||
- `get_cache_dir()`: returns the path to where the cached data is stored. By default, files are cached in the package directory. | ||
- `set_cache_dir()`: set a custom directory to be used. This configuration is persistent across different R sessions. | ||
- `list_cached_data()`: list all files currently cached | ||
- `clean_cache_dir()`: delete all files of the cache directory used by {geocodebr} | ||