Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a naive regex fallback #237

Merged
merged 8 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## [4.4.0](https://github.com/omrilotan/isbot/compare/v4.3.0...v4.4.0)

- Add a naive fallback pattern for engines that do not support lookbehind in regular expressions
- Add isbotNaive function to identify bots using a naive approach (simpler and faster)

## [4.3.0](https://github.com/omrilotan/isbot/compare/v4.2.0...v4.3.0)

- Accept `undefined` in place of user agent string to allow headers property to be used "as is" (`request.headers["user-agent"]`)
Expand Down
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,18 @@ Using JSDeliver CDN you can import an iife script
// isbot is global isbot(navigator.userAgent)
```

## Additional named imports
## How `isbot` maintains accuracy

> `isbot`'s prized possession is the accurate identification of bots using a regular expression. It uses expansive and regularly updated lists of user agent strings to create a regular expression that matches bots and only bots.
>
> This is done by using a lookbehind pattern which is not supported in all environments. A fallback is provided for environments that do not support lookbehind which is less accurate. The test suite includes a percentage of false positives and false negatives which is deemed acceptable for the fallback: 1% false positive and 75% bot coverage.

## All named imports

| import | Type | Description |
| ------------------- | ------------------------------------------------- | ---------------------------------------------------------------------------- |
| isbot | _(userAgent: string): boolean_ | Check if the user agent is a bot |
| isbotNaive | _(userAgent: string): boolean_ | Check if the user agent is a bot using a naive pattern (less accurate) |
| pattern | _RegExp_ | The regular expression used to identify bots |
| list | _string[]_ | List of all individual pattern parts |
| isbotMatch | _(userAgent: string): string \| null_ | The substring matched by the regular expression |
Expand Down
5 changes: 3 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "isbot",
"version": "4.3.0",
"version": "4.4.0",
"description": "🤖 Recognise bots/crawlers/spiders using the user agent string.",
"keywords": [
"bot",
Expand Down Expand Up @@ -44,13 +44,14 @@
"default": "./index.js"
}
},
"sideEffects": false,
"types": "index.d.ts",
"scripts": {
"prepare": "./scripts/prepare/index.js",
"build": "./scripts/build/procedure.sh",
"format": "./scripts/format/procedure.sh",
"pretest": "npm run build && npm run prepare",
"test": "node --expose-gc node_modules/.bin/jest --verbose",
"test": "./scripts/test/procedure.sh",
"prepublishOnly": "./scripts/prepublish/procedure.sh",
"prestart": "which parcel || npm i parcel-bundler --no-save",
"start": "parcel page/index.pug --out-dir docs",
Expand Down
18 changes: 12 additions & 6 deletions scripts/build/pattern.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,16 @@
import { writeFile } from "node:fs/promises";
import patterns from "../../src/patterns.json" assert { type: "json" };

const pattern = new RegExp(patterns.join("|"), "i").toString();
const code = `
export const regex: RegExp = ${pattern};
export const parts: number = ${patterns.length};
export const size: number = ${pattern.length};
`.trim();
const pattern = new RegExp(
patterns
.map((pattern) => pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"))
.join("|"),
).source;

const expression = new RegExp(patterns.join("|"), "i").toString();

const code = [
`export const fullPattern: string = "${pattern}";`,
`export const regularExpression: RegExp = ${expression};`,
].join("\n");
await writeFile("src/pattern.ts", code);
19 changes: 19 additions & 0 deletions scripts/test/procedure.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env bash

failures=0

node --expose-gc node_modules/.bin/jest --verbose $@
failures=$((failures + $?))

echo $(which es-check)
if [[ -z $(which es-check) ]]; then
echo "es-check not found. install locally."
npm install es-check --no-save
failures=$((failures + $?))
fi

es-check es2015 index.iife.js
failures=$((failures + $?))

echo -e "→ Number of failures: ${failures}"
exit $failures
29 changes: 25 additions & 4 deletions src/index.ts
Original file line number Diff line number Diff line change
@@ -1,23 +1,44 @@
import { regex } from "./pattern";
import { fullPattern, regularExpression } from "./pattern";
import patternsList from "./patterns.json";

/**
* Naive bot pattern.
*/
const naivePattern = /bot|spider|crawl|http|lighthouse/i;

// Workaround for TypeScript's type definition of imported variables and JSON files.

/**
* A pattern that matches bot identifiers in user agent strings.
*/
export const pattern: RegExp = regex;
export const pattern = regularExpression;

/**
* A list of bot identifiers to be used in a regular expression against user agent strings.
*/
export const list: string[] = patternsList;

/**
* Check if the given user agent includes a bot pattern. Naive implementation (less accurate).
*/
export const isbotNaive = (userAgent?: string | null): boolean =>
Boolean(userAgent) && naivePattern.test(userAgent);

let usedPattern: RegExp;
/**
* Check if the given user agent includes a bot pattern.
*/
export const isbot = (userAgent?: string | null): boolean =>
Boolean(userAgent) && pattern.test(userAgent);
export function isbot(userAgent?: string | null): boolean {
if (typeof usedPattern === "undefined") {
try {
// Build this RegExp dynamically to avoid syntax errors in older engines.
usedPattern = new RegExp(fullPattern, "i");
} catch (error) {
usedPattern = naivePattern;
}
}
return Boolean(userAgent) && usedPattern.test(userAgent);
}

/**
* Create a custom isbot function with a custom pattern.
Expand Down
6 changes: 3 additions & 3 deletions src/patterns.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
"(?<![hg]m)score",
"@[a-z]",
"\\(at\\)[a-z]",
"\\(github\\.com/",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem out of scope for this fix?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this is a side effect of my creating the naive pattern. I examined parts of the existing regular expression and replaced them with more efficient patterns that allow the same coverage.

"\\[at\\][a-z]",
"^12345",
"^<",
Expand Down Expand Up @@ -55,6 +54,7 @@
"^mozilla/\\d\\.\\d \\w*$",
"^navermailapp",
"^netsurf",
"^nuclei",
"^offline explorer",
"^php",
"^postman",
Expand All @@ -64,6 +64,7 @@
"^read",
"^reed",
"^rest",
"^serf",
"^snapchat",
"^space bison",
"^svn",
Expand Down Expand Up @@ -132,7 +133,6 @@
"offbyone",
"optimize",
"pageburst",
"pagespeed",
"parser",
"perl",
"phantom",
Expand Down Expand Up @@ -163,7 +163,7 @@
"synapse",
"synthetic",
"torrent",
"tracemyfile",
"trace",
"transcoder",
"twingly recon",
"url",
Expand Down
46 changes: 46 additions & 0 deletions tests/spec/__snapshots__/test.ts.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// Jest Snapshot v1, https://goo.gl/fbAQLP

exports[`isbot module interface interface is as expected 1`] = `
[
[
"pattern",
"RegExp",
],
[
"list",
"Array",
],
[
"isbotNaive",
"Function",
],
[
"isbot",
"Function",
],
[
"createIsbot",
"Function",
],
[
"createIsbotFromList",
"Function",
],
[
"isbotMatch",
"Function",
],
[
"isbotMatches",
"Function",
],
[
"isbotPattern",
"Function",
],
[
"isbotPatterns",
"Function",
],
]
`;
85 changes: 85 additions & 0 deletions tests/spec/test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,33 @@ import {
pattern,
list,
isbot,
isbotNaive,
isbotMatch,
isbotMatches,
isbotPattern,
isbotPatterns,
createIsbot,
createIsbotFromList,
} from "../../src";
import { fullPattern, regularExpression } from "../../src/pattern";
import { crawlers, browsers } from "../../fixtures";
let isbotInstance: any;

const BOT_USER_AGENT_EXAMPLE =
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
const BROWSER_USER_AGENT_EXAMPLE =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91 Safari/537.36";

const USER_AGENT_COMMON = [
"Ada Chat Bot/1.0 Request Block",
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse",
];
const USER_AGENT_GOTCHAS = [
"Mozilla/5.0 (Linux; Android 10; CUBOT_X30) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.85 Mobile Safari/537.36",
"PS4Application libhttp/1.000 (PS4) CoreMedia libhttp/6.72 (PlayStation 4)",
];

describe("isbot", () => {
describe("features", () => {
test("pattern: pattern is a regex", () => {
Expand Down Expand Up @@ -79,6 +92,65 @@ describe("isbot", () => {
);
});

describe("isbotNaive", () => {
test.each([75])(
"a large number of user agent strings can be detected (>%s%)",
(percent) => {
const ratio =
crawlers.filter((ua) => isbotNaive(ua)).length / crawlers.length;
expect(ratio).toBeLessThan(1);
expect(ratio).toBeGreaterThan(percent / 100);
},
);
test.each([1])(
"a small number of browsers is falsly detected as bots (<%s%)",
(percent) => {
const ratio =
browsers.filter((ua) => isbotNaive(ua)).length / browsers.length;
expect(ratio).toBeGreaterThan(0);
expect(ratio).toBeLessThan(percent / 100);
},
);
});

describe("regex fallback", () => {
beforeAll(async () => {
jest
.spyOn(globalThis, "RegExp")
.mockImplementation((pattern, flags): RegExp => {
if ((pattern as string).includes?.("?<!")) {
throw new Error("Invalid regex");
}
return new RegExp(pattern, flags);
});
const mdl = await import("../../index.js");
if (!mdl) {
throw new Error("Module not found");
}
isbotInstance = mdl.isbot as ReturnType<typeof createIsbot>;
});
afterAll(() => {
jest.restoreAllMocks();
});
test("fallback regex detects commong crawlers", () => {
USER_AGENT_COMMON.forEach((ua) => {
if (!isbotInstance(ua)) {
throw new Error(`Failed to detect ${ua} as bot`);
}
});
});
test("fallback detects gotchas as bots", () => {
USER_AGENT_GOTCHAS.forEach((ua) => {
if (!isbotInstance(ua)) {
throw new Error(`Failed to detect ${ua} as bot (gotcha)`);
}
});
});
test("fallback does not detect browser as bot", () => {
expect(isbotInstance(BROWSER_USER_AGENT_EXAMPLE)).toBe(false);
});
});

describe("fixtures", () => {
test(`✔︎ ${crawlers.length} user agent string should be recognised as crawler`, () => {
let successCount = 0;
Expand Down Expand Up @@ -107,4 +179,17 @@ describe("isbot", () => {
expect(successCount).toBe(browsers.length);
});
});

describe("module interface", () => {
test("interface is as expected", async () => {
const types = Object.entries(await import("../../src/index")).map(
([key, value]) => [key, value.constructor.name] as [string, string],
);
expect(types).toMatchSnapshot();
});
test("regular expressions exports are as expected", () => {
expect(pattern).toBe(regularExpression);
expect(new RegExp(fullPattern, "i").toString()).toBe(pattern.toString());
});
});
});
Loading