Skip to content

Latest commit

 

History

History
412 lines (351 loc) · 14.5 KB

README.md

File metadata and controls

412 lines (351 loc) · 14.5 KB

stylom - A command line utility for stylometric text analysis

1 Description

This is a tool to facilitate the stylometric analysis of texts. It could be used for academic disambiguation of disputed authorship, and to help identify plagiarists, astroturfers, sockpuppets and guerilla marketers. Another possible use case is as an assistance to the anonymisation of writing style.

Every author has their own unique writing style, and if enough writing examples are available then it's possible to construct a quantitative model of their style which can be compared against others.

2 Installation

The easiest way to install is from a pre-compiled package, a number of which are available at:

https://build.opensuse.org/project/show?project=home%3Amotters%3Astylom

But if you prefer not to get involved with binaries then this program is pretty easy to compile and install as follows:

make
sudo make install

If you wish to generate a Debian package you can also run the following script:

./debian.sh

To plot graphs you will need to have gnuplot installed. For example:

sudo apt-get install gnuplot

3 Generating Author Fingerprints

Fingerprints are high dimensional vectors which represent the writing style of particular authors. To generate a fingerprint first gather some examples of the author's writing within plain text files and put them in a directory. Then run something similar to the following:

cat texts/dickens/* | stylom -n "Charles Dickens" -f > fingerprints/dickens.style

Here the texts get piped into the command and the resulting fingerprint is then saved to a file. Ideally the amount of example text should be as large as possible.

4 Matching Unknown Text

If you have a file containing a sample of text for which the author is unknown, but who is likely to be an author for which you have previously generated a fingerprint, you can find the most likely candidate in the following way:

cat unknown.txt | stylom --match fingerprints

Where fingerprints is the name of a directory containing previously saved fingerprints for a variety of possible authors. This will return a single name, but it's also possible to return a list of candidates in the following way.

cat unknown.txt | stylom --list fingerprints

5 Counter-Stylometry

Stylometrics have various legitimate uses in terms of the study of disputed historical texts or detecting plagiarism. However, a possible danger of stylometric methods is their abuse by powerful organisations against bloggers or whistleblowers who may be trying to raise issues of public concern in an anonymous manner in order to avoid serious reprisals.

A known method to render stylometry less effective in determining the identity of an author is to try to make the style of writing as similar as possible to some existing author, hence obscuring the differences. Automatically transforming the writing style of a given text into the writing style of a known author is a very difficult and likely AI-complete problem, but this program can be used provide the writer with advice on how to alter their work to achieve a greater degree of similarity.

cat mytext.txt | stylom -c fingerprints/charles_dickens.style

or

stylom -t mytext.txt -c fingerprints/charles_dickens.style

Takes some text and compares it to a pre-computed fingerprint for Charles Dickens. It then gives some indication of how to alter mytext.txt in order to make it more similar to Dickens' writing style. The number of differences is also shown, which can be used as an indicator of progress.

If you are only interested in which words are present in mytext.txt but missing from Dickens writing:

cat mytext.txt | stylom --missing fingerprints/charles_dickens.style

or

stylom -t mytext.txt --missing fingerprints/charles_dickens.style

The results could then be used by some other program (maybe highlighted in a text editor GUI). The same can also be done for words which are more frequent within the fingerprint or less frequent within the fingerprint.

cat mytext.txt | stylom --more fingerprints/charles_dickens.style

or

cat mytext.txt | stylom --less fingerprints/charles_dickens.style

The above methods can be used to make the vocabulary and the word frequency similar to an existing known author. It's not a perfect solution, since it takes no account of syntax, and it still involves mannual editing effort by the writer.

6 Plotting Text Distributions

You can plot the distribution of fingerprints in the following way.

stylom --plot <directory>

This plots any fingerprints within the given directory within a 2D graph so that you can see how authors are distributed. The graph is saved by default with the filename result.png

If necessary you can also specify a title for the graph.

stylom --title "My Graph Title" --plot <directory>

You can also plot texts more directly without having previously calculated fingerprints for them.

stylom --title "My Texts" --plottexts <directory>

7 Options

7.1 -t,–text <string or filename>

This can be either text to be analysed or a filename containing plain text.

7.2 -n,–name <author>

The name of the author.

7.3 -f,–fingerprint

Displays a fingerprint on the console in XML format.

7.4 -m,–match <directory>

Match the given fingerprint against ones stored within a directory.

7.5 -l,–list <directory>

Match the given fingerprint against ones stored within a directory and show a list of the most similar authors.

7.6 -c,–compare <fingerprint>

Compares the current text against the given fingerprint and reports differences.

7.7 –missing <fingerprint>

Report words which are present in the current text but which are missing from the given fingerprint.

7.8 –more <fingerprint>

Report words which are present more frequently in the given fingerprint than in the current text.

7.9 –less <fingerprint>

Report words which are present less frequently in the given fingerprint than in the current text.

7.10 -p,–plot <directory>

Plots fingerprints within the given directory using gnuplot. The resulting image is saved as result.png

7.11 –plottexts <directory>

Plots texts within the given directory using gnuplot. The resulting image is saved as result.png

7.12 –title <text>

Specify a title for use together with the –plot option.

7.13 -v,–version

Show the version number.

7.14 -h,–help

Show help.

8 Bugs

9 Author

Bob Mottram <[email protected]>

GPG ID: 0xEA982E38

GPG Fingerprint: D538 1159 CD7A 2F80 2F06 ABA0 0452 CC7C EA98 2E38

http://robotics.uk.to

Author: Bob Mottram

Created: 2013-06-10 Mon 15:48

Emacs 24.2.1 (Org mode 8.0.3)

Validate XHTML 1.0