-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
127 lines (96 loc) · 4.11 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit the latter -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/russHyde/dupree/branch/main/graph/badge.svg)](https://codecov.io/gh/russHyde/dupree?branch=main)
[![R-CMD-check](https://github.com/russHyde/dupree/workflows/R-CMD-check/badge.svg)](https://github.com/russHyde/dupree/actions)
<!-- badges: end -->
# dupree
The goal of `dupree` is to identify chunks / blocks of highly duplicated code
within a set of R scripts.
A very lightweight approach is used:
- The user provides a set of `*.R` and/or `*.Rmd` files;
- All R-code in the user-provided files is read and code-blocks are identified;
- The non-trivial symbols from each code-block are retained (for instance,
really common symbols like `<-`, `,`, `+`, `(` are dropped);
- Similarity between different blocks is calculated using `stringdist::seq_sim`
by longest-common-subsequence (symbol-identity is at whole-word level - so
"my_data", "my_Data", "my.data" and "myData" are not considered to be identical
in the calculation - and all non-trivial symbols have equal weight in the
similarity calculation);
- Code-blocks pairs (both between and within the files) are returned in order
of highest similarity
To prevent the results being dominated by high-identity blocks containing very
few symbols (eg, `library(dplyr)`) the user can specify a `min_block_size`. Any
code-block containing at least this many non-trivial symbols will be kept.
## Installation
You can install `dupree` from github with:
```{r gh-installation, eval = FALSE}
if (!"dupree" %in% installed.packages()) {
# Alternatively:
# install.packages("dupree")
remotes::install_github("russHyde/dupree")
}
```
## Example
To run `dupree` over a set of R files, you can use the `dupree()`,
`dupree_dir()` or `dupree_package()` functions. For example, to identify
duplication within all of the `.R` and `.Rmd` files for the `dupree` package
you could run the following:
```{r example}
## basic example code
library(dupree)
files <- dir(pattern = "*.R(md)*$", recursive = TRUE)
dupree(files)
```
Any top-level code blocks that contain at least
`r formals(dupree)$min_block_size` non-trivial tokens are
included in the above analysis (a token being a function or variable name, an
operator etc; but ignoring comments, white-space and some really common tokens:
`[](){}-+$@:,=`, `<-`, `&&` etc). To be more restrictive, you could consider
larger code-blocks (increase `min_block_size`) within just the `./R/` source
code directory:
```{r}
# R-source code files in the ./R/ directory of the dupree package:
source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE)
# analyse any code blocks that contain at least 50 non-trivial tokens
dupree(source_files, min_block_size = 50)
```
For each (sufficiently big) code block in the provided files, `dupree` will
return the code-block that is most-similar to it (although any given block
may be present in the results multiple times if it is the closest match for
several other code blocks).
Code block pairs with a higher `score` value are more similar. `score` lies in
the range [0, 1]; and is calculated by the
[`stringdist`](https://github.com/markvanderloo/stringdist) package: matching
occurs at the token level: the token "my_data" is no more similar to the token
"myData" than it is to "x".
If you find code-block-pairs with a similarity score much greater than 0.5
there is probably some commonality that could be abstracted away.
----
Note that you can do something similar using the functions `dupree_dir` and
(if you are analysing a package) `dupree_package`.
```{r}
# Analyse all R files in the R/ directory:
dupree_dir(".", filter = "R/")
```
```{r}
# Analyse all R files except those in the tests / presentations directories:
# `dupree_dir` uses grep-like arguments
dupree_dir(
".",
filter = "tests|presentations", invert = TRUE
)
```
```{r}
# Analyse all R source code in the package (only looking at the ./R/ directory)
dupree_package(".")
```