Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add read support for dta formats 120 and 121. closes #85 #86

Merged
merged 4 commits into from
Aug 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ LinkingTo: Rcpp
ByteCompile: yes
Suggests: testthat
Encoding: UTF-8
RoxygenNote: 7.2.3
RoxygenNote: 7.3.2
14 changes: 11 additions & 3 deletions R/read.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (C) 2014-2021 Jan Marvin Garbuszus and Sebastian Jeworutzki
# Copyright (C) 2014-2024 Jan Marvin Garbuszus and Sebastian Jeworutzki
# Copyright (C) of 'convert.dates' and 'missing.types' Thomas Lumley
#
# This program is free software; you can redistribute it and/or modify it
Expand Down Expand Up @@ -29,7 +29,7 @@
#' "label_(integer code)".
#' @param encoding \emph{character.} Strings can be converted from Windows-1252
#' or UTF-8 to system encoding. Options are "latin1" or "UTF-8" to specify
#' target encoding explicitly. Stata 14, 15 and 16 files are UTF-8 encoded and
#' target encoding explicitly. Since Stata 14 files are UTF-8 encoded and
#' may contain strings which can't be displayed in the current locale.
#' Set encoding=NULL to stop reencoding.
#' @param fromEncoding \emph{character.} We expect strings to be encoded as
Expand Down Expand Up @@ -93,6 +93,13 @@
#'
#' Reading dta-files of older and newer versions than 13 was introduced
#' with version 0.8.
#'
#' Stata 18 introduced alias variables and frame files. Alias variables are
#' currently ignored when reading the file and a warning is printed. Stata
#' frame files (file extension `.dtas`) contain zipped `dta` files which can
#' be loaded individually. The read test provides an example how to construct
#' the alias variables from a Stata frame file.
#'
#' @return The function returns a data.frame with attributes. The attributes
#' include
#' \describe{
Expand Down Expand Up @@ -127,7 +134,7 @@
#' \dontrun{
#' library(readstata13)
#' r13 <- read.dta13("https://www.stata-press.com/data/r13/auto.dta")
#' }
#' }
#' @author Jan Marvin Garbuszus \email{jan.garbuszus@@ruhr-uni-bochum.de}
#' @author Sebastian Jeworutzki \email{sebastian.jeworutzki@@ruhr-uni-bochum.de}
#' @useDynLib readstata13, .registration = TRUE
Expand Down Expand Up @@ -212,6 +219,7 @@ read.dta13 <- function(file, convert.factors = TRUE, generate.factors=FALSE,

sstr <- 2045
sstrl <- 32768
salias <- 65525
sdouble <- 65526
sfloat <- 65527
slong <- 65528
Expand Down
3 changes: 1 addition & 2 deletions R/readstata13.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@
#'
#' @name readstata13
#' @aliases readstata13-package
#' @docType package
#' @useDynLib readstata13, .registration = TRUE
#' @import Rcpp
#' @note If you catch a bug, please do not sue us, we do not have any money.
#' @seealso \code{\link[foreign]{read.dta}} and \code{memisc} for dta files from
#' Stata Versions < 13
NULL
"_PACKAGE"
13 changes: 5 additions & 8 deletions R/save.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,16 @@
#' to Stata date time format. Code from \code{foreign::write.dta}
#' @param convert.underscore \emph{logical.} If \code{TRUE}, all non numerics or
#' non alphabet characters will be converted to underscores.
#' @param tz \emph{character.} time zone specification to be used for
#' POSIXct values and dates (if convert.dates is TRUE). ‘""’ is the current
#' @param tz \emph{character.} time zone specification to be used for
#' POSIXct values and dates (if convert.dates is TRUE). ‘""’ is the current
#' time zone, and ‘"GMT"’ is UTC (Universal Time, Coordinated).
#' @param add.rownames \emph{logical.} If \code{TRUE}, a new variable rownames
#' will be added to the dta-file.
#' @param compress \emph{logical.} If \code{TRUE}, the resulting dta-file will
#' use all of Statas numeric-vartypes.
#' @param version \emph{numeric.} Stata format for the resulting dta-file either
#' Stata version number (6 - 16) or the internal Stata dta-format (e.g. 117 for
#' Stata 13). Experimental support for large datasets: Use version="15mp" to
#' Stata 13). Support for large datasets: Use version="15mp" to
#' save the dataset in the new Stata 15/16 MP file format. This feature is not
#' thoroughly tested yet.
#' @return The function writes a dta-file to disk. The following features of the
Expand All @@ -68,7 +68,7 @@
#' \dontrun{
#' library(readstata13)
#' save.dta13(cars, file="cars.dta")
#' }
#' }
#' @author Jan Marvin Garbuszus \email{jan.garbuszus@@ruhr-uni-bochum.de}
#' @author Sebastian Jeworutzki \email{sebastian.jeworutzki@@ruhr-uni-bochum.de}
#' @useDynLib readstata13, .registration = TRUE
Expand Down Expand Up @@ -104,10 +104,7 @@ save.dta13 <- function(data, file, data.label=NULL, time.stamp=TRUE,
if (version==6)
version <- 108

if (version == 119)
message("Support for Stata 15/16 MP (119) format is experimental and not thoroughly tested.")

if (version<102 | version == 109 | version == 116 | version>119)
if (version<102 | version == 109 | version == 116 | version>121)
stop("Version mismatch abort execution. No Data was saved.")

sstr <- 2045
Expand Down
Binary file added inst/extdata/myproject2.dtas
Binary file not shown.
4 changes: 3 additions & 1 deletion inst/include/readstata.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (C) 2015-2017 Jan Marvin Garbuszus and Sebastian Jeworutzki
* Copyright (C) 2015-2024 Jan Marvin Garbuszus and Sebastian Jeworutzki
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
Expand Down Expand Up @@ -155,6 +155,8 @@ inline Rcpp::IntegerVector calc_rowlength(Rcpp::IntegerVector vartype) {
case STATA_STRL:
rlen(i) = 8;
break;
case STATA_ALIAS: // 0
break;
default:
rlen(i) = type;
break;
Expand Down
3 changes: 2 additions & 1 deletion inst/include/statadefines.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (C) 2015 Jan Marvin Garbuszus and Sebastian Jeworutzki
* Copyright (C) 2015-2023 Jan Marvin Garbuszus and Sebastian Jeworutzki
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
Expand Down Expand Up @@ -53,6 +53,7 @@
#define STATA_INT 65528
#define STATA_FLOAT 65527
#define STATA_DOUBLE 65526
#define STATA_ALIAS 65525

#define STATA_STR 2045
#define STATA_SHORT_STR 244
Expand Down
10 changes: 8 additions & 2 deletions man/read.dta13.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions man/save.dta13.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 14 additions & 1 deletion src/read_data.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (C) 2014-2018 Jan Marvin Garbuszus and Sebastian Jeworutzki
* Copyright (C) 2014-2024 Jan Marvin Garbuszus and Sebastian Jeworutzki
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
Expand Down Expand Up @@ -46,6 +46,12 @@ List read_data(FILE * file,
SET_VECTOR_ELT(df, i, IntegerVector(no_init(nn)));
break;

// return correct column size and create a warning
case STATA_ALIAS:
SET_VECTOR_ELT(df, i, CharacterVector(no_init(nn)));
Rf_warning("File contains unhandled alias variable in column: %d", i + 1);
break;

default:
SET_VECTOR_ELT(df, i, CharacterVector(no_init(nn)));
break;
Expand Down Expand Up @@ -166,6 +172,7 @@ List read_data(FILE * file,
break;
}
case 118:
case 120:
{
int16_t v = 0;
int64_t o = 0, z = 0;
Expand Down Expand Up @@ -193,6 +200,7 @@ List read_data(FILE * file,
break;
}
case 119:
case 121:
{
int32_t v = 0;
int64_t o = 0, z = 0;
Expand Down Expand Up @@ -221,8 +229,13 @@ List read_data(FILE * file,
}
}
break;
}
case STATA_ALIAS:
{
break; // do nothing
}
// case < 0:
// case STATA_ALIAS
default:
{
// skip to the next valid case
Expand Down
21 changes: 13 additions & 8 deletions src/read_dta.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (C) 2014-2023 Jan Marvin Garbuszus and Sebastian Jeworutzki
* Copyright (C) 2014-2024 Jan Marvin Garbuszus and Sebastian Jeworutzki
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
Expand Down Expand Up @@ -38,7 +38,7 @@ List read_dta(FILE * file,
*/

int8_t fversion = 117L; //f = first
int8_t lversion = 119L; //l = last
int8_t lversion = 121L; //l = last

std::string version(3, '\0');
readstring(version, file, version.size());
Expand Down Expand Up @@ -74,6 +74,8 @@ List read_dta(FILE * file,
break;
case 118:
case 119:
case 120:
case 121:
nvarnameslen = 129;
nformatslen = 57;
nvalLabelslen = 129;
Expand Down Expand Up @@ -106,9 +108,9 @@ List read_dta(FILE * file,
*/

uint32_t k = 0;
if (release < 119)
if (release < 119 || release == 120)
k = readbin((uint16_t)k, file, swapit);
if (release == 119)
if (release == 119 || release == 121)
k = readbin(k, file, swapit);

//</K>
Expand All @@ -123,7 +125,7 @@ List read_dta(FILE * file,

if (release == 117)
n = readbin((uint32_t)n, file, swapit);
if ((release == 118) | (release == 119))
if ((release >= 118) && (release <= 121))
n = readbin(n, file, swapit);

//</N>
Expand All @@ -146,7 +148,7 @@ List read_dta(FILE * file,

if (release == 117)
ndlabel = readbin((int8_t)ndlabel, file, swapit);
if ((release == 118) | (release == 119))
if ((release >= 118) && (release <= 121))
ndlabel = readbin(ndlabel, file, swapit);

std::string datalabel(ndlabel, '\0');
Expand Down Expand Up @@ -224,6 +226,7 @@ List read_dta(FILE * file,
* vartypes.
* 0-2045: strf (String: Max length 2045)
* 32768: strL (long String: Max length 2 billion)
* 65525: alias
* 65526: double
* 65527: float
* 65528: long
Expand Down Expand Up @@ -274,9 +277,9 @@ List read_dta(FILE * file,
{
uint32_t nsortlist = 0;

if ((release == 117) | (release == 118))
if ((release == 117) || (release == 118) || (release == 120))
nsortlist = readbin((uint16_t)nsortlist, file, swapit);
if (release == 119)
if (release == 119 || release == 121)
nsortlist = readbin(nsortlist, file, swapit);

sortlist[i] = nsortlist;
Expand Down Expand Up @@ -530,6 +533,8 @@ List read_dta(FILE * file,
}
case 118:
case 119:
case 120:
case 121:
{
uint32_t v = 0;
uint64_t o = 0;
Expand Down
Loading
Loading