Generate fake data with R

Generate fake data with R

August 3, 2022
R
programming, data

There are times when you might need to generate some fake, synthetic data.

This might be for a demonstration, for testing, or in cases where the real data should not be touched, such as when the data is highly sensitive.

Fortunately, there are a bunch of handy packages available in R to help with generating fake data, including:

Coverage #

Here’s a table reflecting the types of values available from randNames, charlatan, and generator. I’ve left off fakir for now because the data types generated are fundamentally quite different. The [fakir] section includes some demonstrations of the types of data that do get generated.

Feature randNames charlatan generator
Color, hex Yes
Color, name Yes
Color, rgb Yes
Credit card, number Yes Yes
Credit card, provider Yes
Credit card, security code Yes
Date of Birth Yes Yes
DOI Yes
Email Yes Yes Yes
Gender Yes
Gene sequence Yes
Identifier, type Yes Yes
Identifier, value Yes
IP address Yes Yes
Location, City Yes
Location, coordinates Yes Yes
Location, Postcode Yes
Location, State Yes
Location, Street Yes
Login, md5 Yes
Login, Password Yes
Login, salt Yes
Login, sha1 Yes
Login, sha256 Yes
Login, Username Yes
Name, First Yes Yes Yes
Name, Last Yes Yes Yes
Nationality Yes
Occupation Yes
Phone Yes Yes Yes
Phone, Cell Yes
Picture, large Yes
Picture, medium Yes
Picture, thumbnail Yes
Registration Date Yes
Registration Duration Yes
Title Yes
URI/URL Yes

Note that there’s interpretation to how I’ve presented the coverage table above. For example, Identifier, type (which is identifier.type in the raw naming format withing randName) refers to the type of identifier, such as a social security number. The corresponding Identifier, value is the value of the identifier, such as a specific social security value.

In some other packages, that might simply appear under a field called SSN. I’ve made the decision in this particular example to adopt the randName field and value conventions around the identifier.

randNames #

randName leverages the Random Names API, and at the moment, is my preferred way of generating synthetic data.

I like it for its very tidy approach to structuring data.

The set of values it offers include:

gender  
email  
registered.date
registered.age
dob  
phone  
cell  
nat  
name.title  
name.first       
name.last  
location.street  
location.city  
location.state  
location.postcode  
login.username    
login.password  
login.salt  
login.md5  
login.sha1  
login.sha256  
id.name  
id.value  
picture.large  
picture.medium     
picture.thumbnail 

First, install and load the package.

# install.packages("randNames")
library("randNames")

To generate, say, 5 fake identities, run the following:

rand_names(5)
## # A tibble: 5 × 34
##   gender email   phone cell  nat   name.title name.first name.last location.city
##   <chr>  <chr>   <chr> <chr> <chr> <chr>      <chr>      <chr>     <chr>        
## 1 male   imre.f… 0394… 0170… DE    Mr         Imre       Fenzl     Dietfurt an …
## 2 female maia.p… (665… (595… NZ    Miss       Maia       Patel     Rotorua      
## 3 female marie.… 6158… 4828… DK    Ms         Marie      Johansen  Askeby       
## 4 male   visesl… 011-… 068-… RS    Mr         Višeslav   Zeljković Varvarin     
## 5 female amelia… 04-0… 06-6… FR    Miss       Amelia     Rey       Tourcoing    
## # ℹ 25 more variables: location.state <chr>, location.country <chr>,
## #   location.postcode <int>, location.street.number <int>,
## #   location.street.name <chr>, location.coordinates.latitude <chr>,
## #   location.coordinates.longitude <chr>, location.timezone.offset <chr>,
## #   location.timezone.description <chr>, login.uuid <chr>,
## #   login.username <chr>, login.password <chr>, login.salt <chr>,
## #   login.md5 <chr>, login.sha1 <chr>, login.sha256 <chr>, dob.date <chr>, …

The set of returned values can be tweaked based on gender (male or female) and nationality (AU, BR, CA, CH, DE, DK, ES, FI, FR, GB, IE, IR, NL, NZ, TR, US).

To generate a set of 10 French female names and emails for instance, the code would be:

rand_names(10, nationality = "FR", gender = "female") %>%
  select(name.first, name.last, email)
## # A tibble: 10 × 3
##    name.first name.last  email                         
##    <chr>      <chr>      <chr>                         
##  1 Kelya      Bernard    kelya.bernard@example.com     
##  2 Olivia     Roy        olivia.roy@example.com        
##  3 Lily       Gerard     lily.gerard@example.com       
##  4 Lison      Guillot    lison.guillot@example.com     
##  5 Coline     Guerin     coline.guerin@example.com     
##  6 Axelle     Leclercq   axelle.leclercq@example.com   
##  7 Emmie      Lecomte    emmie.lecomte@example.com     
##  8 Lola       Jean       lola.jean@example.com         
##  9 Anaëlle    Carpentier anaelle.carpentier@example.com
## 10 Margaux    Roche      margaux.roche@example.com

If there’s a need to maintain consistency to generating the random values from iteration to iteration, you can define an arbitrary seed value.

rand_names(10, nationality = "FR", gender = "female", seed = "croissant") %>%
  select(name.first, name.last, email)
## # A tibble: 10 × 3
##    name.first name.last email                        
##    <chr>      <chr>     <chr>                        
##  1 Léa        Louis     lea.louis@example.com        
##  2 Bérénice   Morel     berenice.morel@example.com   
##  3 Laly       Gaillard  laly.gaillard@example.com    
##  4 Charlotte  Guillot   charlotte.guillot@example.com
##  5 Agathe     Meyer     agathe.meyer@example.com     
##  6 Aubin      Perrin    aubin.perrin@example.com     
##  7 Noah       Meunier   noah.meunier@example.com     
##  8 Alessio    Charles   alessio.charles@example.com  
##  9 Amelia     Clement   amelia.clement@example.com   
## 10 Armand     Marchand  armand.marchand@example.com

charlatan #

The values available through charlatan include:

person names
jobs
phone numbers
colors: names, hex, rgb
credit cards
DOIs
numbers in range and from distributions
gene sequences
geographic coordinates
emails
URIs, URLs, and their parts
IP addresses

Installation and loading:

# install.packages("charlatan")
library("charlatan")

Generating a set of names:

ch_name(n = 5)
## [1] "Terese McKenzie"      "Rossie Schneider V"   "Adeline Hammes DVM"  
## [4] "Marlon Lang"          "Donal Ruecker-Renner"

Generating a set of occupations:

ch_job(n = 5)
## [1] "Engineer, production"            "Management consultant"          
## [3] "Geophysicist/field seismologist" "Social researcher"              
## [5] "Fine artist"

It’s also possible to specify locales from the set fr_FR, fr_CH, hr_FR, fa_IR, pl_PL, ru_RU, uk_UA, zh_TW.

ch_job(n = 5, locale = "fr_FR")
## [1] "Façadier"                            
## [2] "Accompagnateur de moyenne montagne"  
## [3] "Coffreur"                            
## [4] "Enseignant d'art"                    
## [5] "Gestionnaire de contrats d'assurance"

Generating a set of credit card numbers:

ch_credit_card_number(n = 5)
## [1] "6011927283694711484" "4278016663256"       "676354915893769"    
## [4] "3484873615860482"    "501806477997933"

Generating a dataset, including name, occupation, and phone number:

ch_generate(n = 5)
## # A tibble: 5 × 3
##   name                 job                               phone_number      
##   <chr>                <chr>                             <chr>             
## 1 Dr. Major Prosacco V Special educational needs teacher (862)423-5517     
## 2 Gil Ritchie-Kutch    Dancer                            (353)112-8325x6184
## 3 Laurel Sauer         Engineer, chemical                648-525-4045x65747
## 4 Debby Lang-Yost      Database administrator            04068700943       
## 5 Maritza Barton       Volunteer coordinator             +96(2)8298800855

generator #

generator creates fake personally identifiabl information, including:

Full name
E-mail address
Date of birth
Telephone number
Latitude and longtiude
National identification number
IP address
Credit card number

The package hasn’t been updated for several years, and provides a set of data that is a subset of what some of the other mentioned packages provide. For these reasons, I probably wouldn’t rely on generator.

# install.packages("generator")
library(generator)

fakir #

The fakir package is at an early stage of development, but seems promising and creates synthetic records of types that are fundamentally different from the other packages mentioned so far.

It seems that the authors are French, so there are some French terms peppered throughout. For instance, sondage (“survey”) and nom (“name”) appear in some of the results.

Here are some of the main functions, pulled from the help documentation for fakir:

fake_base_clients
fake_products
fake_sondage_answers
fake_sondage_people
fake_ticket_client
fake_user_feedback
fake_visits
fra_sf
# devtools::install_github("ThinkR-open/fakir")
library(fakir)

Fake clients:

fake_base_clients(n = 5)
## # A tibble: 5 × 14
##   num_client first last  job     age region id_dpt departement cb_provider name 
## * <chr>      <chr> <chr> <chr> <dbl> <chr>  <chr>  <chr>       <chr>       <chr>
## 1 1          Carl… Cass… Stag…    53 Midi-… 65     Hautes-Pyr… VISA 13 di… Carl…
## 2 2          Leer… Beat… Clot…    85 Midi-… 65     Hautes-Pyr… VISA 16 di… Leer…
## 3 3          Dr.   Tala… Acco…    27 Rhône… 07     Ardèche     Diners Clu… Dr. …
## 4 4          Sier… Hett… Scie…    33 Basse… 61     Orne        JCB 16 dig… Sier…
## 5 5          Ammon Dick… Spec…    40 Lorra… 88     Vosges      Mastercard  Ammo…
## # ℹ 4 more variables: entry_date <dttm>, fidelity_points <dbl>,
## #   priority_encoded <dbl>, priority <fct>

Fake products:

fake_products(n = 5)
## # A tibble: 5 × 8
##   name                  brand color price body_location category sent_from    id
##   <chr>                 <chr> <chr> <int> <chr>         <chr>    <chr>     <int>
## 1 Step and Distance Pe… Beer… Sadd…     2 Waist         Enterta… Taiwan        1
## 2 Biking Tracker U Pro… Gerl… Hone…     2 Torso         Medical  Japan         2
## 3 Wearable Transmitter… Beer… Medi…     2 Chest         Gaming   China         3
## 4 Multifunction Tracke… Scha… DimG…     5 Brain         Gaming   France        4
## 5 Action Camera Pro     Huds… Salm…     4 Chest         Pets an… Netherla…     5

Fake survey responses:

fake_sondage_answers(n = 5)
## # A tibble: 15 × 12
##    id_individu   age sexe  region                 id_departement nom_departement
##    <chr>       <int> <chr> <chr>                  <chr>          <chr>          
##  1 ID-RJXN-02     53 F     Bourgogne              71             Saône-et-Loire 
##  2 ID-RJXN-02     53 F     Bourgogne              71             Saône-et-Loire 
##  3 ID-RJXN-02     53 F     Bourgogne              71             Saône-et-Loire 
##  4 ID-VMKS-04     90 F     Provence-Alpes-Côte d… 13             <NA>           
##  5 ID-VMKS-04     90 F     Provence-Alpes-Côte d… 13             <NA>           
##  6 ID-VMKS-04     90 F     Provence-Alpes-Côte d… 13             <NA>           
##  7 ID-XEMZ-03     84 O     Auvergne               43             Haute-Loire    
##  8 ID-XEMZ-03     84 O     Auvergne               43             Haute-Loire    
##  9 ID-XEMZ-03     84 O     Auvergne               43             Haute-Loire    
## 10 ID-EUDQ-05     65 M     Picardie               80             Somme          
## 11 ID-EUDQ-05     65 M     Picardie               80             Somme          
## 12 ID-EUDQ-05     65 M     Picardie               80             Somme          
## 13 ID-NMQG-01     60 O     Picardie               60             Oise           
## 14 ID-NMQG-01     60 O     Picardie               60             Oise           
## 15 ID-NMQG-01     60 O     Picardie               60             Oise           
## # ℹ 6 more variables: question_date <dttm>, year <dbl>, type <chr>,
## #   distance_km <dbl>, transport <fct>, temps_trajet_en_heures <dbl>
fake_sondage_people(n = 5)
## # A tibble: 5 × 8
##   id_individu   age sexe  region                  id_departement nom_departement
##   <chr>       <int> <chr> <chr>                   <chr>          <chr>          
## 1 ID-RJXN-02     53 F     Nord-Pas-de-Calais      62             Pas-de-Calais  
## 2 ID-VMKS-04     90 F     Aquitaine               47             Lot-et-Garonne 
## 3 ID-XEMZ-03     84 O     Provence-Alpes-Côte d'… 05             <NA>           
## 4 ID-EUDQ-05     65 M     Centre                  37             Indre-et-Loire 
## 5 ID-NMQG-01     60 O     Haute-Normandie         76             Seine-Maritime 
## # ℹ 2 more variables: question_date <dttm>, year <dbl>

Parting thoughts #

None of these data generation packages is perfect or complete. Depending on the use case, it might be necessary to use a combination of packages together to arrive at whatever end state you’re aiming for.