Geocoding refers to the act of finding a point in space, usually +represented by a pair of coordinates, given a street address. The +geocodebr packages allows one to efficiently geocode +Brazilian addresses using the National Registry of Addresses for +Statistical Purposes (english for Cadastro Nacional de Endereços +para Fins Estatísticos, CNEFE)[^1], a data set collected and published +by the Brazilian official statistics and geography office, IBGE, +containing the addresses of more than 100 million households and +establishments in Brazil.
+Basic usage +
+Before using geocodebr, please make sure you have it +installed in your computer. You can download either the most stable +version from CRAN…
+
+install.packages("geocodebr")
… or the development version from GitHub.
+
+# install.packages("pak")
+pak::pak("ipeaGIT/geocodebr")
Then attach it to the current R session:
+ +The main entry point to the package’s functionalities is
+geocode()
, which takes a data frame of addresses as input
+and outputs the same data frame with the latitude and longitude of each
+matched address, as well as two columns indicating the precision level
+of the matches. To demonstrate its usage, the package includes a few
+sample data sets in the installation. In the example below, we use a
+small data set that contains addresses with commonly seen issues, such
+as missing information and mistyped fields.
Note: Running the function for the first time may
+take a while, since geocodebr needs to download the CNEFE data,
+which sums up to about 5.5 GB. Alternatively, you can use
+download_cnefe()
to download the data before geocoding
+(geocode()
does that behind the scenes).
+input_addresses <- read.csv(
+ system.file("extdata/small_sample.csv", package = "geocodebr")
+)
+
+result <- geocodebr::geocode(
+ input_addresses,
+ address_fields = geocodebr::setup_address_fields(
+ logradouro = "nm_logradouro",
+ numero = "Numero",
+ cep = "Cep",
+ bairro = "Bairro",
+ municipio = "nm_municipio",
+ estado = "nm_uf"
+ ),
+ progress = FALSE
+)
+#> Warning: The input of the field 'number' has observations with non numeric characters.
+#> These observations were transformed to NA.
+#> Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
+
+head(result)
+#> id nm_logradouro Numero Cep Bairro
+#> 1 1 RUA MARIA LUCIA PACIFICO 17 26042-730 SANTA RITA
+#> 2 2 RUA LEOPOLDINA TOME 46 25030-050 CENTENARIO
+#> 3 3 RUA DONA JUDITE 0 23915-700 CAPUTERA II
+#> 4 4 RUA ALEXANDRE AMARAL 0 23098-120 SANTISSIMO
+#> 5 5 AVENIDA E 300 23860-000 PRAIA GRANDE
+#> 6 6 RUA PRINCESA ISABEL 263 69921-026 ESTACAO EXPERIMENTAL
+#> nm_municipio code_muni nm_uf lon lat match_type
+#> 1 NOVA IGUACU 3303500 RIO DE JANEIRO -43.47118 -22.695496 en01
+#> 2 DUQUE DE CAXIAS 3301702 RIO DE JANEIRO -43.31134 -22.779173 en01
+#> 3 ANGRA DOS REIS 3300100 RIO DE JANEIRO -44.20841 -22.978631 ei01
+#> 4 RIO DE JANEIRO 3304557 RIO DE JANEIRO -43.51047 -22.870022 ei01
+#> 5 MANGARATIBA 3302601 RIO DE JANEIRO -43.97214 -22.929864 en01
+#> 6 RIO BRANCO 1200401 ACRE -67.83559 -9.963436 en01
+#> precision
+#> 1 numero
+#> 2 numero
+#> 3 numero_interpolado
+#> 4 numero_interpolado
+#> 5 numero
+#> 6 numero
obs. Note that the first time the user runs this function, +{geocodebr} will download a few files and store them locally. This way, +the data only needs to be downloaded once. More info about data caching +below.
+The output coordinates use the official geodetic reference system
+used in Brazil: SIRGAS2000, CRS(4674). The results of {geocodebr} are
+classified into six broad precision
categories depending on
+how exactly each input address was matched with CNEFE data. The accuracy
+of the results are indicated in two columns of the output:
+precision
and match_type
. More information
+below.
Precision categories: +
+The results of {geocodebr} are classified into six broad
+precision
categories:
-
+
- “numero” +
- “numero_interpolado” +
- “rua” +
- “cep” +
- “bairro” +
- “municipio” +
-
+
NA
(not found)
+
Each precision level can be disaggregated into more refined match +types.
+Match Type +
+The column match_type
provides more refined information
+on how exactly each input address was matched with CNEFE. In every
+category, {geocodebr} takes the average latitude and longitude of the
+addresses included in CNEFE that match the input address based on
+combinations of different fields. In the strictest case, for example,
+the function finds a deterministic match for all of the fields of a
+given address ("estado"
, "municipio"
,
+"logradouro"
, "numero"
, "cep"
,
+"localidade"
). Think for example of a building with several
+apartments that match the same street address and number. In such case,
+the coordinates of the apartments will differ very slightly, and
+{geocodebr} takes the average of those coordinates. In a less rigorous
+example, in which only the fields ("estado"
,
+"municipio"
, "logradouro"
,
+"localidade"
) are matched, {geocodebr} calculates the
+average coordinates of all the addresses in CNEFE along that street and
+which fall within the same neighborhood.
The complete list of precision levels, their corresponding match type +categories and the fields considered in each category are described +below:
+-
+
- precision: “numero”
+
-
+
- match_type:
+
-
+
- en01: logradouro, numero, cep e bairro +
- en02: logradouro, numero e cep +
- en03: logradouro, numero e bairro +
- en04: logradouro e numero +
- pn01: logradouro, numero, cep e bairro +
- pn02: logradouro, numero e cep +
- pn03: logradouro, numero e bairro +
- pn04: logradouro e numero +
+
+ - match_type:
+
- precision: “numero_interpolado”
+
-
+
- match_type:
+
-
+
- ei01: logradouro, numero, cep e bairro +
- ei02: logradouro, numero e cep +
- ei03: logradouro, numero e bairro +
- ei04: logradouro e numero +
- pi01: logradouro, numero, cep e bairro +
- pi02: logradouro, numero e cep +
- pi03: logradouro, numero e bairro +
- pi04: logradouro e numero +
+
+ - match_type:
+
- precision: “rua” (when input number is missing
+‘S/N’)
+
-
+
- match_type:
+
-
+
- er01: logradouro, cep e bairro +
- er02: logradouro e cep +
- er03: logradouro e bairro +
- er04: logradouro +
- pr01: logradouro, cep e bairro +
- pr02: logradouro e cep +
- pr03: logradouro e bairro +
- pr04: logradouro +
+
+ - match_type:
+
- precision: “cep”
+
-
+
- match_type:
+
-
+
- ec01: municipio, cep, localidade +
- ec02: municipio, cep +
+
+ - match_type:
+
- precision: “bairro”
+
-
+
- match_type:
+
-
+
- eb01: municipio, localidade +
+
+ - match_type:
+
- precision: “municipio”
+
-
+
- match_type:
+
-
+
- em01: municipio +
+
+ - match_type:
+
Note: Match types starting with ‘p’ use +probabilistic matching of the logradouro field, while types starting +with ‘e’ use deterministic matching only. Match types with +probabilistic matching are not implemented in {geocodebr} +yet.
+Data cache +
+The first time the user runs the geocode()
function,
+{geocodebr} will download a few reference files and store them locally.
+This way, the data only needs to be downloaded once. Mind you that these
+files require approximately 4GB of space in your local drive.
The package includes the following functions to help users manage +cached files:
+-
+
-
+
get_cache_dir()
: returns the path to where the cached +data is stored. By default, files are cached in the package +directory.
+ -
+
set_cache_dir()
: set a custom directory to be used. +This configuration is persistent across different R sessions.
+ -
+
list_cached_data()
: list all files currently +cached
+ -
+
clean_cache_dir()
: delete all files of the cache +directory used by {geocodebr}
+