Generate fake data with R
August 3, 2022
There are times when you might need to generate some fake, synthetic data.
This might be for a demonstration, for testing, or in cases where the real data should not be touched, such as when the data is highly sensitive.
Fortunately, there are a bunch of handy packages available in R to help with generating fake data, including:
Coverage #
Here’s a table reflecting the types of values available from randNames
, charlatan
, and generator
. I’ve left off fakir
for now because the data types generated are fundamentally quite different. The [fakir] section includes some demonstrations of the types of data that do get generated.
Feature | randNames | charlatan | generator |
---|---|---|---|
Color, hex | Yes | ||
Color, name | Yes | ||
Color, rgb | Yes | ||
Credit card, number | Yes | Yes | |
Credit card, provider | Yes | ||
Credit card, security code | Yes | ||
Date of Birth | Yes | Yes | |
DOI | Yes | ||
Yes | Yes | Yes | |
Gender | Yes | ||
Gene sequence | Yes | ||
Identifier, type | Yes | Yes | |
Identifier, value | Yes | ||
IP address | Yes | Yes | |
Location, City | Yes | ||
Location, coordinates | Yes | Yes | |
Location, Postcode | Yes | ||
Location, State | Yes | ||
Location, Street | Yes | ||
Login, md5 | Yes | ||
Login, Password | Yes | ||
Login, salt | Yes | ||
Login, sha1 | Yes | ||
Login, sha256 | Yes | ||
Login, Username | Yes | ||
Name, First | Yes | Yes | Yes |
Name, Last | Yes | Yes | Yes |
Nationality | Yes | ||
Occupation | Yes | ||
Phone | Yes | Yes | Yes |
Phone, Cell | Yes | ||
Picture, large | Yes | ||
Picture, medium | Yes | ||
Picture, thumbnail | Yes | ||
Registration Date | Yes | ||
Registration Duration | Yes | ||
Title | Yes | ||
URI/URL | Yes |
Note that there’s interpretation to how I’ve presented the coverage table above. For example, Identifier, type
(which is identifier.type
in the raw naming format withing randName
) refers to the type of identifier, such as a social security number. The corresponding Identifier, value
is the value of the identifier, such as a specific social security value.
In some other packages, that might simply appear under a field called SSN
. I’ve made the decision in this particular example to adopt the randName
field and value conventions around the identifier.
randNames #
randName leverages the Random Names API, and at the moment, is my preferred way of generating synthetic data.
I like it for its very tidy approach to structuring data.
The set of values it offers include:
gender
email
registered.date
registered.age
dob
phone
cell
nat
name.title
name.first
name.last
location.street
location.city
location.state
location.postcode
login.username
login.password
login.salt
login.md5
login.sha1
login.sha256
id.name
id.value
picture.large
picture.medium
picture.thumbnail
First, install and load the package.
# install.packages("randNames")
library("randNames")
To generate, say, 5 fake identities, run the following:
rand_names(5)
## # A tibble: 5 × 34
## gender email phone cell nat name.title name.first name.last location.city
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 male imre.f… 0394… 0170… DE Mr Imre Fenzl Dietfurt an …
## 2 female maia.p… (665… (595… NZ Miss Maia Patel Rotorua
## 3 female marie.… 6158… 4828… DK Ms Marie Johansen Askeby
## 4 male visesl… 011-… 068-… RS Mr Višeslav Zeljković Varvarin
## 5 female amelia… 04-0… 06-6… FR Miss Amelia Rey Tourcoing
## # ℹ 25 more variables: location.state <chr>, location.country <chr>,
## # location.postcode <int>, location.street.number <int>,
## # location.street.name <chr>, location.coordinates.latitude <chr>,
## # location.coordinates.longitude <chr>, location.timezone.offset <chr>,
## # location.timezone.description <chr>, login.uuid <chr>,
## # login.username <chr>, login.password <chr>, login.salt <chr>,
## # login.md5 <chr>, login.sha1 <chr>, login.sha256 <chr>, dob.date <chr>, …
The set of returned values can be tweaked based on gender (male
or female
) and nationality (AU, BR, CA, CH, DE, DK, ES, FI, FR, GB, IE, IR, NL, NZ, TR, US
).
To generate a set of 10 French female names and emails for instance, the code would be:
rand_names(10, nationality = "FR", gender = "female") %>%
select(name.first, name.last, email)
## # A tibble: 10 × 3
## name.first name.last email
## <chr> <chr> <chr>
## 1 Kelya Bernard kelya.bernard@example.com
## 2 Olivia Roy olivia.roy@example.com
## 3 Lily Gerard lily.gerard@example.com
## 4 Lison Guillot lison.guillot@example.com
## 5 Coline Guerin coline.guerin@example.com
## 6 Axelle Leclercq axelle.leclercq@example.com
## 7 Emmie Lecomte emmie.lecomte@example.com
## 8 Lola Jean lola.jean@example.com
## 9 Anaëlle Carpentier anaelle.carpentier@example.com
## 10 Margaux Roche margaux.roche@example.com
If there’s a need to maintain consistency to generating the random values from iteration to iteration, you can define an arbitrary seed value.
rand_names(10, nationality = "FR", gender = "female", seed = "croissant") %>%
select(name.first, name.last, email)
## # A tibble: 10 × 3
## name.first name.last email
## <chr> <chr> <chr>
## 1 Léa Louis lea.louis@example.com
## 2 Bérénice Morel berenice.morel@example.com
## 3 Laly Gaillard laly.gaillard@example.com
## 4 Charlotte Guillot charlotte.guillot@example.com
## 5 Agathe Meyer agathe.meyer@example.com
## 6 Aubin Perrin aubin.perrin@example.com
## 7 Noah Meunier noah.meunier@example.com
## 8 Alessio Charles alessio.charles@example.com
## 9 Amelia Clement amelia.clement@example.com
## 10 Armand Marchand armand.marchand@example.com
charlatan #
The values available through charlatan
include:
person names
jobs
phone numbers
colors: names, hex, rgb
credit cards
DOIs
numbers in range and from distributions
gene sequences
geographic coordinates
emails
URIs, URLs, and their parts
IP addresses
Installation and loading:
# install.packages("charlatan")
library("charlatan")
Generating a set of names:
ch_name(n = 5)
## [1] "Terese McKenzie" "Rossie Schneider V" "Adeline Hammes DVM"
## [4] "Marlon Lang" "Donal Ruecker-Renner"
Generating a set of occupations:
ch_job(n = 5)
## [1] "Engineer, production" "Management consultant"
## [3] "Geophysicist/field seismologist" "Social researcher"
## [5] "Fine artist"
It’s also possible to specify locales from the set fr_FR, fr_CH, hr_FR, fa_IR, pl_PL, ru_RU, uk_UA, zh_TW
.
ch_job(n = 5, locale = "fr_FR")
## [1] "Façadier"
## [2] "Accompagnateur de moyenne montagne"
## [3] "Coffreur"
## [4] "Enseignant d'art"
## [5] "Gestionnaire de contrats d'assurance"
Generating a set of credit card numbers:
ch_credit_card_number(n = 5)
## [1] "6011927283694711484" "4278016663256" "676354915893769"
## [4] "3484873615860482" "501806477997933"
Generating a dataset, including name, occupation, and phone number:
ch_generate(n = 5)
## # A tibble: 5 × 3
## name job phone_number
## <chr> <chr> <chr>
## 1 Dr. Major Prosacco V Special educational needs teacher (862)423-5517
## 2 Gil Ritchie-Kutch Dancer (353)112-8325x6184
## 3 Laurel Sauer Engineer, chemical 648-525-4045x65747
## 4 Debby Lang-Yost Database administrator 04068700943
## 5 Maritza Barton Volunteer coordinator +96(2)8298800855
generator #
generator
creates fake personally identifiabl information, including:
Full name
E-mail address
Date of birth
Telephone number
Latitude and longtiude
National identification number
IP address
Credit card number
The package hasn’t been updated for several years, and provides a set of data that is a subset of what some of the other mentioned packages provide. For these reasons, I probably wouldn’t rely on generator
.
# install.packages("generator")
library(generator)
fakir #
The fakir
package is at an early stage of development, but seems promising and creates synthetic records of types that are fundamentally different from the other packages mentioned so far.
It seems that the authors are French, so there are some French terms peppered throughout. For instance, sondage
(“survey”) and nom
(“name”) appear in some of the results.
Here are some of the main functions, pulled from the help documentation for fakir
:
fake_base_clients
fake_products
fake_sondage_answers
fake_sondage_people
fake_ticket_client
fake_user_feedback
fake_visits
fra_sf
# devtools::install_github("ThinkR-open/fakir")
library(fakir)
Fake clients:
fake_base_clients(n = 5)
## # A tibble: 5 × 14
## num_client first last job age region id_dpt departement cb_provider name
## * <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Carl… Cass… Stag… 53 Midi-… 65 Hautes-Pyr… VISA 13 di… Carl…
## 2 2 Leer… Beat… Clot… 85 Midi-… 65 Hautes-Pyr… VISA 16 di… Leer…
## 3 3 Dr. Tala… Acco… 27 Rhône… 07 Ardèche Diners Clu… Dr. …
## 4 4 Sier… Hett… Scie… 33 Basse… 61 Orne JCB 16 dig… Sier…
## 5 5 Ammon Dick… Spec… 40 Lorra… 88 Vosges Mastercard Ammo…
## # ℹ 4 more variables: entry_date <dttm>, fidelity_points <dbl>,
## # priority_encoded <dbl>, priority <fct>
Fake products:
fake_products(n = 5)
## # A tibble: 5 × 8
## name brand color price body_location category sent_from id
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Step and Distance Pe… Beer… Sadd… 2 Waist Enterta… Taiwan 1
## 2 Biking Tracker U Pro… Gerl… Hone… 2 Torso Medical Japan 2
## 3 Wearable Transmitter… Beer… Medi… 2 Chest Gaming China 3
## 4 Multifunction Tracke… Scha… DimG… 5 Brain Gaming France 4
## 5 Action Camera Pro Huds… Salm… 4 Chest Pets an… Netherla… 5
Fake survey responses:
fake_sondage_answers(n = 5)
## # A tibble: 15 × 12
## id_individu age sexe region id_departement nom_departement
## <chr> <int> <chr> <chr> <chr> <chr>
## 1 ID-RJXN-02 53 F Bourgogne 71 Saône-et-Loire
## 2 ID-RJXN-02 53 F Bourgogne 71 Saône-et-Loire
## 3 ID-RJXN-02 53 F Bourgogne 71 Saône-et-Loire
## 4 ID-VMKS-04 90 F Provence-Alpes-Côte d… 13 <NA>
## 5 ID-VMKS-04 90 F Provence-Alpes-Côte d… 13 <NA>
## 6 ID-VMKS-04 90 F Provence-Alpes-Côte d… 13 <NA>
## 7 ID-XEMZ-03 84 O Auvergne 43 Haute-Loire
## 8 ID-XEMZ-03 84 O Auvergne 43 Haute-Loire
## 9 ID-XEMZ-03 84 O Auvergne 43 Haute-Loire
## 10 ID-EUDQ-05 65 M Picardie 80 Somme
## 11 ID-EUDQ-05 65 M Picardie 80 Somme
## 12 ID-EUDQ-05 65 M Picardie 80 Somme
## 13 ID-NMQG-01 60 O Picardie 60 Oise
## 14 ID-NMQG-01 60 O Picardie 60 Oise
## 15 ID-NMQG-01 60 O Picardie 60 Oise
## # ℹ 6 more variables: question_date <dttm>, year <dbl>, type <chr>,
## # distance_km <dbl>, transport <fct>, temps_trajet_en_heures <dbl>
fake_sondage_people(n = 5)
## # A tibble: 5 × 8
## id_individu age sexe region id_departement nom_departement
## <chr> <int> <chr> <chr> <chr> <chr>
## 1 ID-RJXN-02 53 F Nord-Pas-de-Calais 62 Pas-de-Calais
## 2 ID-VMKS-04 90 F Aquitaine 47 Lot-et-Garonne
## 3 ID-XEMZ-03 84 O Provence-Alpes-Côte d'… 05 <NA>
## 4 ID-EUDQ-05 65 M Centre 37 Indre-et-Loire
## 5 ID-NMQG-01 60 O Haute-Normandie 76 Seine-Maritime
## # ℹ 2 more variables: question_date <dttm>, year <dbl>
Parting thoughts #
None of these data generation packages is perfect or complete. Depending on the use case, it might be necessary to use a combination of packages together to arrive at whatever end state you’re aiming for.