Downloading and anonymizing archived releases

With the Release Database Writer tool, you can make changes to your releases at a database level, anonymize your installation, or download archived releases as JSON files. This topic describes how to use the Database Writer to download and anonymize your archived releases.

Data hidden in the process

The Database Writer connects to the Release archiving database and downloads each release as a JSON file in a specified folder. The tool scans through the contents of the release and removes sensitive information such as user names. List of items the tool searches for:

  • emails (john.smith@my-org.com)
  • people names (John, Smith)
  • user names (jsmith)
  • organization names (my-org, MyOrg, My Org)
  • location names (countries)
  • telephone numbers (+1 …)
  • mentions (@j-smith)

This information can be found using regular expressions or using dictionaries that are stored as text files inside the tool distribution folder. You can find and customize the dictionaries inside the database-writer-<version>.jar archive under BOOT-INF/classes/dictionary/. Examples:

  • emails: using regular expression
  • people names: in first_names.txt and surnames.txt files
  • user names: loaded from the Release users table
  • organization names or other additional names: are supplied as a parameter when running the tool
  • location names: in the geolocations.txt file
  • telephone numbers: using regular expressions selected based on the locale specified when running the tool
  • mentions: using regular expression

There is an additional file named ignore.txt that contains a list of exclusions: strings that should not be replaced. For example: the word release is used throughout release JSONs, so if the tool replaces this string, the JSON might lose the meaning.

Installation

You can download the latest version of the database-writer tool from the customer download area.

  1. Unpack database-writer-*.zip into a new directory database-writer.
  2. If you are using the repository keystore, copy XLR_HOME/conf/repository-keystore.jceks to database-writer/repository-keystore.jceks.
  3. Create a new file database-writer/app.properties with the following content:
# Connection details for the reporting database of the Release
xlr.datasource.url=jdbc:mysql://localhost/xlarchive?useSSL=false
xlr.datasource.username=xlarchive
xlr.datasource.password=password
xlr.datasource.driver-class-name=com.mysql.jdbc.Driver

# If you have a repository keystore file with a password then you should specify the following two properties:
repository.keystore.location=./repository-keystore.jceks
repository.keystore.password=thepassword
  1. Copy the JDBC driver of the archive database to database-writer/lib/.
  2. Optionally, you can create a file with extra names of organizations or people that you want to make sure are replaced. Place one string per line. You can use a multi-word string and it will replace it completely using case sensitive characters.

Example: a file ./names.txt with the following content:

ACME
ACME Inc.
acme
  1. To start the tool, execute:

    ./bin/database-writer --spring.config.location=app.properties
  2. To download the anonymized archived releases, execute the following command:

    read_all_archived --path ./archived_releases --anonymize --locale US --additional-replacements-path ./names.txt --anonymization-output-path ./replacements-made.txt --skip-existing
  3. The anonymization process can take several minutes to finish and requires CPU resources. After the process is finished, all the archived releases will be present as JSON files in the specified directory (./archived_releases/**/*.json in this example).

The file ./replacements-made.txt will contain a mapping of the strings that were replaced by placeholders. This file is there for you verify the contents and must not be shared. For example:

EMAIL_4: jsmith@my-org.com
...
MENTION_9: @j-smith
...
NAME_7: John
...
ORG_1: My Org
...
USERNAME_5: jsmith
...
TELEPHONE_6: 2122620703

In releases where the specified user was mentioned, you will see placeholders such as _EMAIL_4_ or _NAME_7_.

To start the tool without interactive shells:

  1. Create a file with the command line, e.g. ./command.cli:

    read_all_archived --path ./archived_releases --anonymize --locale US --additional-replacements-path ./names.txt --anonymization-output-path ./replacements-made.txt --skip-existing
  2. Run the tool:

    ./bin/database-writer --spring.config.location=app.properties @command.cli

As a result of running the tool there will be archived releases in the specified folder. You can pack these and use them in a testing environment.