Title: | Privacy Preserving Record Linkage |
---|---|
Description: | A toolbox for deterministic, probabilistic and privacy-preserving record linkage techniques. Combines the functionality of the 'Merge ToolBox' (<https://www.record-linkage.de>) with current privacy-preserving techniques. |
Authors: | Rainer Schnell [aut, cph], Dorothea Rukasz [aut, cre], Christian Borgs [ctb], Stefan Brumme [ctb] (HMAC, SHA256), William B. Brogden [ctb] (Metaphone), Tim O'Brien [ctb] (Metaphone), Stephen Lacy [ctb] (Double Metaphone), Apache Software Foundation [cph] |
Maintainer: | Dorothea Rukasz <[email protected]> |
License: | GPL-3 |
Version: | 0.3.8 |
Built: | 2024-11-20 05:05:01 UTC |
Source: | https://github.com/cran/PPRL |
Comparing all elements of two vectors of records with each other using Armknechts & Schnells methods "create" and "compare".
CompareAS16(IDA, dataA, IDB, dataB, password, t = 0.85)
CompareAS16(IDA, dataA, IDB, dataB, password, t = 0.85)
IDA |
A character vector or integer vector containing the IDs of the first data.frame. |
dataA |
A character vector containing the bit vectors that are to be created by Armknechts method "create". |
IDB |
A character vector or integer vector containing the IDs of the second data.frame. |
dataB |
A character vector containing the bit vectors that are to be created by Armknechts method "create". |
password |
A string containing the password used in the method "create". |
t |
A float containing the lower Tanimoto similarity threshold. |
Two bit vectors generated by CreateAS16
are compared as described in the original publication.
The function returns a data.frame with four columns containing all ID-pairs of all bit vectors, the estimated Tanimoto similarity and the classification (links/non-links).
Armknecht, F., Schnell, R. (unpublished): Privacy Preserving Record Linkage Based on Bloom Filters and Codes. Working Paper.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create Bloom Filter testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Optional: Create the new Bloom filter, the output of this function is just # to see the created bit vectors, it is not the input of CompareAS16. testAS <- CreateAS16(testData$ID, testData$CLKs, password = "khäuds") # Compare bit vectcors, input is not the out put of CreateAS16, # but the original Bloom Filters. CreateAS16 is executated in CompareAS16. res <- CompareAS16(testData$ID, testData$CLKs, testData$ID, testData$CLKs, password = "khäuds", t = 0.85)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create Bloom Filter testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Optional: Create the new Bloom filter, the output of this function is just # to see the created bit vectors, it is not the input of CompareAS16. testAS <- CreateAS16(testData$ID, testData$CLKs, password = "khäuds") # Compare bit vectcors, input is not the out put of CreateAS16, # but the original Bloom Filters. CreateAS16 is executated in CompareAS16. res <- CompareAS16(testData$ID, testData$CLKs, testData$ID, testData$CLKs, password = "khäuds", t = 0.85)
Creates ESLs (also known as 581-Keys), which are the hashed combination of the full date of birth and sex and subsets of first and last names.
Create581(ID, data, code, password)
Create581(ID, data, code, password)
ID |
a character vector or integer vector containing the IDs of the data.frame. |
data |
a data.frame containig the data to be encoded. |
code |
a list indicating how data is to be encoded for each column. The list must have the same length as the number of columns of the data.frame to be encrypted. |
password |
a string used as a password for the HMAC. |
The original implementation uses the second and third position of the first name, the second, third and fifth position of the last name and full date of birth and sex as an input for an HMAC. This would be akin to using code = list(c(2, 3), c(2, 3, 5), 0, 0)
. In this implementation, the positions of the subsets can be customized.
A data.frame containing IDs and the corresponding Encrypted Statistical Linkage Keys.
Karmel, R., Anderson, P., Gibson, D., Peut, A., Duckett, S., Wells, Y. (2010): Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study. BMC Health Services Research 41(10).
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encrypt data res <- Create581(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], code = list(0, 0, c(2, 3), c(2, 3, 5)), password = "(H]$6Uh*-Z204q") # Code: 0 means the whole string is used, # c(2, 3) means the second and third letter of the string is used
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encrypt data res <- Create581(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], code = list(0, 0, c(2, 3), c(2, 3, 5)), password = "(H]$6Uh*-Z204q") # Code: 0 means the whole string is used, # c(2, 3) means the second and third letter of the string is used
Creates ALCs from clear-text data by creating soundex phonetics for first and last names and concatenating all other identifiers. The resulting code is encrypted using SHA-2. The user can decide on which columns the soundex phonetic is applied.
CreateALC(ID, data, soundex, password)
CreateALC(ID, data, soundex, password)
ID |
A character vector or integer vector containing the IDs of the data.frame. |
data |
a data.frame containing the data to be encoded. |
soundex |
a binary vector with one element for each input column, indicating whether soundex is to be used. 1 = soundex is used, 0 = soundex is not used. The soundex vector must have the same length as the number of columns the data.frame. |
password |
a string used as a password for the HMAC. |
A data.frame containing IDs and the corresponding Anonymous Linkage Codes.
Herzog, T. N., Scheuren, F. J., Winkler, W. E. (2007): Data Quality and Record Linkage Techniques. Springer.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encrypt data, use Soundex for names res <- CreateALC(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], soundex = c(0, 0, 1, 1), password = "$6Uh*-Z204q")
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encrypt data, use Soundex for names res <- CreateALC(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], soundex = c(0, 0, 1, 1), password = "$6Uh*-Z204q")
This method generates a new bit vector out of an existing Bloom Filter. Building and comparisons are both possible with CompareAS16
.
CreateAS16(ID, data, password)
CreateAS16(ID, data, password)
ID |
A character vector or integer vector containing the IDs of the second data.frame. |
data |
A character vector containing the original bit vectors created by any Bloom Filter-based method. |
password |
A string containing the password to be used for both "create" and "compare". |
A character vector containing bit vectors created as described in the original publication.
Armknecht, F., Schnell, R. (unpublished): Privacy Preserving Record Linkage Based on Bloom Filters and Codes. Working Paper.
CompareAS16
,
CreateBF
,
CreateCLK
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create Bloom Filter testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create the new Bloom Filter testAS <- CreateAS16(testData$ID, testData$CLKs, password = "khäuds")
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create Bloom Filter testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create the new Bloom Filter testAS <- CreateAS16(testData$ID, testData$CLKs, password = "khäuds")
Creates CLKs with constant Hamming weights by adding a negated copy of the binary input vector which is then permutated.
CreateBalancedBF(ID, data, password)
CreateBalancedBF(ID, data, password)
ID |
A character vector or integer vector containing the IDs of the data.frame. |
data |
Bit vectors as created by any Bloom filter-based method. |
password |
a string used as a password for the random permutation. |
A data.frame containing IDs and the corresponding Balanced Bloom Filter.
Berger, J. M. (1961): A Note on Error Detection Codes for Asymmetric Channels. In: Information and Control 4: 68–73.
Knuth, Donald E. (1986): Efficient Balanced Codes. In: IEEE Transactions on Information Theory IT-32 (1): 51–53.
Schnell, R., Borgs, C. (2016): Randomized Response and Balanced Bloom Filters for Privacy Preserving Record Linkage. IEEE International Conference on Data Mining (ICDM 2016), Barcelona.
CreateBF
,
CreateBitFlippingBF
,
CreateCLK
,
CreateDoubleBalancedBF
,
CreateEnsembleCLK
,
CreateMarkovCLK
,
CreateRecordLevelBF
,
StandardizeString
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vectors e.g. with CreateBF testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create Balanced Bloom Filters BB <- CreateBalancedBF(ID = testData$ID, data = testData$CLKs, password = "hdayfkgh")
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vectors e.g. with CreateBF testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create Balanced Bloom Filters BB <- CreateBalancedBF(ID = testData$ID, data = testData$CLKs, password = "hdayfkgh")
Creates Bloom filters for each row of the input data by splitting the input into q-grams which are hashed into a bit vector.
CreateBF(ID, data, password, k = 20, padding = 1, qgram = 2, lenBloom = 1000)
CreateBF(ID, data, password, k = 20, padding = 1, qgram = 2, lenBloom = 1000)
ID |
a character vector or integer vector containing the IDs of the data.frame. |
data |
a character vector containing the data to be encoded. Make sure the input vectors are not factors. |
password |
a string used as a password for the random hashing of the q-grams. |
k |
number of bit positions set to one for each bigram. |
padding |
integer (0 or 1) indicating if string padding is to be used. |
qgram |
integer (1 or 2) indicating whether to split the input strings into bigrams (q = 2) or unigrams (q = 1). |
lenBloom |
desired length of the Bloom filter in bits. |
A data.frame containing IDs and the corresponding bit vector.
Schnell, R., Bachteler, T., Reiher, J. (2009): Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 9: 41.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encode data BF <- CreateBF(ID = testData$V1, data = testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q")
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Encode data BF <- CreateBF(ID = testData$V1, data = testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q")
Applies Permanent Randomized Response to flip bits of the bit vectors given.
CreateBitFlippingBF(data, password, f)
CreateBitFlippingBF(data, password, f)
data |
a data.frame containing the IDs in the first column and bit vectors created by any Bloom filter-based method in the second column. |
password |
a string to seed the random bit flipping. |
f |
a numeric between 0 and 1 giving the probability of flipping a bit. |
The randomized response technique is used on each
bit position B[i] of a Bloom filter B. B[i] is set to one or zero with a probability of for each outcome. The bit position remains unchanged with a probability of
, where
.
A data.frame containing IDs and the corresponding bit vector.
Schnell, R., Borgs, C. (2016): Randomized Response and Balanced Bloom Filters for Privacy Preserving Record Linkage. IEEE International Conference on Data Mining (ICDM 2016), Barcelona.
CreateBF
,
CreateCLK
,
StandardizeString
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Encode data into Bloom Filters BF <- CreateBF(ID = testData$V1, data = testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q" ) # Create Permanent Randomized Response Bloom Filter RR <- CreateBitFlippingBF(BF, password = "l+kfdj1J", f = 0.1)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Encode data into Bloom Filters BF <- CreateBF(ID = testData$V1, data = testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q" ) # Create Permanent Randomized Response Bloom Filter RR <- CreateBitFlippingBF(BF, password = "l+kfdj1J", f = 0.1)
Each column of the input data.frame is hashed into a single additive Bloom filter.
CreateCLK(ID, data, password, k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000)
CreateCLK(ID, data, password, k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000)
ID |
A character vector or integer vector containing the IDs of the data.frame. |
data |
a data.frame containing the data to be encoded. Make sure the input vectors are not factors. |
password |
a character vector with a password for each column of the input data.frame for the random hashing of the q-grams. |
k |
number of bit positions set to one for each bigram. |
padding |
integer vector (0 or 1) indicating if string padding is to be used on the columns of the input. The padding vector must have the same size as the number of columns of the input data. |
qgram |
integer vector (1 or 2) indicating whether to split the input strings into bigrams (q = 2) or unigrams (q = 1). The qgram vector must have the same size as the number of columns of the input data. |
lenBloom |
desired length of the final Bloom filter in bits. |
A data.frame containing IDs and the corresponding bit vector.
Schnell, R. (2014): An efficient Privacy-Preserving Record Linkage Technique for Administrative Data and Censuses. Journal of the International Association for Official Statistics (IAOS) 30: 263-270.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Encode data CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5"))
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Encode data CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5"))
Double balanced Bloom Filter are created by first creating balanced Bloom Filters, see CreateBalancedBF
, negating the whole data set and shuffling each Bloom Filter.
CreateDoubleBalancedBF(ID, data, password)
CreateDoubleBalancedBF(ID, data, password)
ID |
A character vector containing the ID. The ID vector must have the same size as the number of rows of data. |
data |
|
password |
A string to encode the routines. |
A data.frame containing IDs and the corresponding double balanced bit vector.
Schnell, R. (2017): Recent Developments in Bloom Filter-based Methods for Privacy-preserving Record Linkage. Curtin Institute for Computation, Curtin University, Perth, 12.9.2017.
CreateBalancedBF
,
CreateBF
,
CreateCLK
,
StandardizeString
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vectors e.g. with CreateBF testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create Double Balanced Bloom Filters DBB <- CreateDoubleBalancedBF(ID = testData$ID, data = testData$CLKs, password = "hdayfkgh")
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vectors e.g. with CreateBF testData <- CreateBF(ID = testData$V1, testData$V7, k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q") # Create Double Balanced Bloom Filters DBB <- CreateDoubleBalancedBF(ID = testData$ID, data = testData$CLKs, password = "hdayfkgh")
Creates multiple CLKs which are combined using a simple majority rule.
CreateEnsembleCLK(ID, data, password, NumberOfCLK = 1 , k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000)
CreateEnsembleCLK(ID, data, password, NumberOfCLK = 1 , k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000)
ID |
A character vector or integer vector containing the IDs of the data.frame. |
data |
a data.frame containing the data to be encoded. Make sure the input vectors are not factors. |
password |
a character vector with a password for each column of the input data.frame for the random hashing of the q-grams. |
NumberOfCLK |
number of independent CLKs to be built. |
k |
number of bit positions set to one for each bigram. |
padding |
integer vector (0 or 1) indicating if string padding is to be used on the columns of the input. The padding vector must have the same size as the number of columns of the input data. |
qgram |
integer vector (1 or 2) indicating whether to split the input strings into bigrams (q = 2) or unigrams (q = 1). The qgram vector must have the same size as the number of columns of the input data. |
lenBloom |
desired length of the final Bloom filter in bits. |
Creates a set number of independent CLKs for each record of the input data.frame and combines them using a simple majority rule. The bit positions in the final CLK of the length
are set to
if more than half of the independent CLKs bit positions
have a value of one. Otherwise the final bit position is zero.
A data.frame containing IDs and the corresponding ensemble bit vector.
Kuncheva, L. (2014): Combining Pattern Classifiers: Methods and Algorithms. Wiley.
CreateBF
,
CreateCLK
,
StandardizeString
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Not run: # Create Ensemble CLK EnsembleCLK <- CreateEnsembleCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 2, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5"), NumberOfCLK = 5) ## End(Not run)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Not run: # Create Ensemble CLK EnsembleCLK <- CreateEnsembleCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 2, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5"), NumberOfCLK = 5) ## End(Not run)
Builds CLKs encoding additional bigrams based on the transition probabilities as estimated by a Markov Chain.
CreateMarkovCLK(ID, data, password, markovTable, k1 = 20, k2 = 4, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000, includeOriginalBigram = TRUE, v = FALSE)
CreateMarkovCLK(ID, data, password, markovTable, k1 = 20, k2 = 4, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = 1000, includeOriginalBigram = TRUE, v = FALSE)
ID |
a character vector or integer vector containing the IDs of the data.frame. |
data |
a data.frame containing the data to be encoded. Make sure the input vectors are not factors. |
password |
a character vector with a password for each column of the input data.frame for the random hashing of the q-grams. |
markovTable |
a numeric matrix containing the transition probabilities for all bigrams possible. |
k1 |
number of bit positions set to one for each bigram. |
k2 |
number of additional bigrams drawn for each original bigram. |
padding |
integer vector (0 or 1) indicating if string padding is to be used on the columns of the input. The padding vector must have the same size as the number of columns of the input data. |
qgram |
integer vector (1 or 2) indicating whether to split the input strings into bigrams (q = 2) or unigrams (q = 1). The qgram vector must have the same size as the number of columns of the input data. |
lenBloom |
desired length of the final Bloom filter in bits. |
includeOriginalBigram |
by default, the original bigram is encoded together with the additional bigrams. Set this to |
v |
verbose. |
A transition matrix for all possible bigrams is built using a function to fit a markov chain distribution using a Laplacian smoother. A transition matrix built for bigrams using the NC Voter Data is included in the package. For each original bigram in the data, k2
new bigrams are drawn according to their follow-up probability as given by the transition matrix. The final bigram set is then encoded following CreateCLK
.
A data.frame containing IDs and the corresponding bit vector.
Schnell R., Borgs C. (2017): Using Markov Chains for Hardening Bloom Filter Encryptions against Cryptographic Attacks in Privacy Preserving Record Linkage. German Record Linkage Center Working Paper.
Schnell, R. (2017): Recent Developments in Bloom Filter-based Methods for Privacy-preserving Record Linkage. Curtin Institute for Computation, Curtin University, Perth, 12.9.2017.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Not run: # Load example Markov chain matrix (created from NC Voter Data) markovFile <-file.path(path.package("PPRL"), "extdata/TestMatrize.csv") markovData <-read.csv(markovFile, sep = " ", header = TRUE, check.names = FALSE) markovData <- as.matrix(markovData) # Create Markov CLK using CLKMarkov <- CreateMarkovCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], markovTable = markovData, k1 = 20, k2 = 4, l = 1000, padding = c(0, 0, 1, 1), q = c(1, 2, 2, 2), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHklk5"), includeOriginalBigram = TRUE) ## End(Not run)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <-read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") ## Not run: # Load example Markov chain matrix (created from NC Voter Data) markovFile <-file.path(path.package("PPRL"), "extdata/TestMatrize.csv") markovData <-read.csv(markovFile, sep = " ", header = TRUE, check.names = FALSE) markovData <- as.matrix(markovData) # Create Markov CLK using CLKMarkov <- CreateMarkovCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], markovTable = markovData, k1 = 20, k2 = 4, l = 1000, padding = c(0, 0, 1, 1), q = c(1, 2, 2, 2), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHklk5"), includeOriginalBigram = TRUE) ## End(Not run)
Creates Record Level Bloom filters, combining single Bloom filters into a singular bit vector.
CreateRecordLevelBF(ID, data, password, lenRLBF = 1000, k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = as.integer(c(500)), method = "StaticUniform", weigths = as.numeric(c(1)))
CreateRecordLevelBF(ID, data, password, lenRLBF = 1000, k = 20, padding = as.integer(c(0)), qgram = as.integer(c(2)), lenBloom = as.integer(c(500)), method = "StaticUniform", weigths = as.numeric(c(1)))
ID |
a character vector or integer vector containing the IDs of the data.frame. |
data |
a character vector containing the data to be encoded. Make sure the input vectors are not factors. |
password |
a string used as a password for the random hashing of the q-grams and the shuffling. |
lenRLBF |
length of the final Bloom filter. |
lenBloom |
an integer vector containing the length of the first level Bloom filters which are to be combined. For the methods "StaticUniform" and "StaticWeighted", a single integer is required, since all original Bloom filters will have the same length. |
k |
number of bit positions set to one for each q-gram. |
padding |
integer (0 or 1) indicating if string padding is to be used. |
qgram |
integer (1 or 2) indicating whether to split the input strings into bigrams (q = 2) or unigrams (q = 1). |
method |
any of either "StaticUniform", "StaticWeigthed", "DynamicUniform" or "DynamicWeighted" (see details). |
weigths |
weigths are used for the "StaticWeighted" and "DynamicWeighted" methods. The weights vector must have the same length as number of columns in the input data. The sum of the weights must be 1. |
Single Bloom filters are first built for every variable in the input data.frame. Combining the Bloom filters is done by sampling a set fraction of the original Bloom filters and concatenating the samples. The result is then shuffled. The sampling can be done using four different weighting methods:
StaticUniform
StaticWeighted
DynamicUniform
DynamicWeighted
Details are described in the original publication.
A data.frame containing IDs and the corresponding bit vector.
Durham, E. A. (2012). A framework for accurate, efficient private record linkage. Dissertation. Vanderbilt University.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") ## Not run: testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # StaticUniform RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = 500, password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "StaticUniform") # StaticWeigthed RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = 500, password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "StaticWeigthed", weigths = c(0.1, 0.1, 0.5, 0.3)) # DynamicUniform RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = c(300, 400, 550, 500), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "DynamicUniform") # DynamicWeigthed RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = c(300, 400, 550, 500), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "DynamicWeigthed", weigths = c(0.1, 0.1, 0.5, 0.3)) ## End(Not run)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") ## Not run: testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # StaticUniform RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = 500, password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "StaticUniform") # StaticWeigthed RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = 500, password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "StaticWeigthed", weigths = c(0.1, 0.1, 0.5, 0.3)) # DynamicUniform RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = c(300, 400, 550, 500), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "DynamicUniform") # DynamicWeigthed RLBF <- CreateRecordLevelBF(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], lenRLBF = 1000, k = 20, padding = c(0, 0, 1, 1), qgram = c(1, 1, 2, 2), lenBloom = c(300, 400, 550, 500), password = c("(H]$6Uh*-Z204q", "lkjg", "klh", "KJHkälk5"), method = "DynamicWeigthed", weigths = c(0.1, 0.1, 0.5, 0.3)) ## End(Not run)
Deterministic Record Linkage of two data sets giving results enabling rule-based methods.
DeterministicLinkage(IDA, dataA, IDB, dataB, blocking = NULL, similarity)
DeterministicLinkage(IDA, dataA, IDB, dataB, blocking = NULL, similarity)
IDA |
A character vector or integer vector containing the IDs of the first data.frame. |
dataA |
A data.frame containing the data to be linked and all linking variables as specified in |
IDB |
A character vector or integer vector containing the IDs of the second data.frame. |
dataB |
A data.frame containing the data to be linked and all linking variables as specified in |
blocking |
Optional blocking variables. See |
similarity |
Variables used for linking and their respective linkage methods as specified in |
To call the Deterministic Linkage function it is necessary to set up linking variables and methods. Using blocking variables is optional. Further options are available in SelectBlockingFunction
and SelectSimilarityFunction
.
A data.frame containing ID-pairs and the link status for each linking variable. This way, rules can be put into place allowing the user to classify links and non-links.
Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.
ProbabilisticLinkage
,
SelectBlockingFunction
,
SelectSimilarityFunction
,
StandardizeString
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the soundex phonetic (first name) # and exact matching (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "Soundex") l2 <- SelectSimilarityFunction("V8", "V8", method = "exact") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- DeterministicLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the soundex phonetic (first name) # and exact matching (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "Soundex") l2 <- SelectSimilarityFunction("V8", "V8", method = "exact") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- DeterministicLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
Unordered Pairing Function creating a new unique integer from two input integers.
ElegantPairingInt(int1, int2)
ElegantPairingInt(int1, int2)
int1 |
first integer to be paired. |
int2 |
second integer to be paired. |
With two of non-negative integers x and y as an input, the pairing is computed as:
The function is commutative. x and y have to be non-negative integers.
The function outputs a single non-negative integer that is uniquely associated with that unordered pair.
Szudzik, M. (2006): An Elegant Pairing Function. Wolfram Science Conference NKS 2006.
ElegantPairingInt(2, 3)
ElegantPairingInt(2, 3)
Unordered Pairing Function creating a new unique integer from two input integers in a data.frame
.
ElegantPairingVec (ID, data)
ElegantPairingVec (ID, data)
ID |
A character vector or integer vector containing the IDs of the data.frame. |
data |
a |
With two of non-negative integers x and y as an input, the pairing is computed as:
The function is commutative. x and y have to be non-negative integers. The function outputs a single non-negative integer that is uniquely associated with that unordered pair.
A data.frame containing IDs and the computed integer.
Szudzik, M. (2006): An Elegant Pairing Function. Wolfram Science Conference NKS 2006.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create numeric data frame of day and month of birth dataInt <- data.frame(as.integer(testData$V4), as.integer(testData$V5)) # Use unordered pairing on day and month res <- ElegantPairingVec(testData$V1, dataInt)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create numeric data frame of day and month of birth dataInt <- data.frame(as.integer(testData$V4), as.integer(testData$V5)) # Use unordered pairing on day and month res <- ElegantPairingVec(testData$V1, dataInt)
Probabilistic Record Linkage of two data sets using distance-based or probabilistic methods.
ProbabilisticLinkage(IDA, dataA, IDB, dataB, blocking = NULL, similarity)
ProbabilisticLinkage(IDA, dataA, IDB, dataB, blocking = NULL, similarity)
IDA |
A character vector or integer vector containing the IDs of the first data.frame. |
dataA |
A data.frame containing the data to be linked and all linking variables as specified in |
IDB |
A character vector or integer vector containing the IDs of the second data.frame. |
dataB |
A data.frame containing the data to be linked and all linking variables as specified in |
blocking |
Optional blocking variables. See |
similarity |
Variables used for linking and their respective linkage methods as specified in |
To call the Probabilistic Linkage function it is necessary to set up linking variables and methods. Using blocking variables is optional. Further options are available in SelectBlockingFunction
and SelectSimilarityFunction
. Using this method, the Fellegi-Sunter model is used, with the EM algorithm estimating the weights (Winkler 1988).
A data.frame containing pairs of IDs, their corresponding similarity value and the match status as determined by the linkage procedure.
Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.
Winkler, W. E. (1988): Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods Vol. 667, American Statistical Association: 671.
CreateBF
,
CreateCLK
,
DeterministicLinkage
,
SelectBlockingFunction
,
SelectSimilarityFunction
,
StandardizeString
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the Jaro-Winkler similarity measure (first name) # and levenshtein distance (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "jw") l2 <- SelectSimilarityFunction("V8", "V8", method = "lv") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the Jaro-Winkler similarity measure (first name) # and levenshtein distance (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "jw") l2 <- SelectSimilarityFunction("V8", "V8", method = "lv") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
Before calling ProbabilisticLinkage
or DeterministicLinkage
, a blocking method can be selected. For each blocking variable desired, the function call has to be repeated.
SelectBlockingFunction(variable1, variable2, method)
SelectBlockingFunction(variable1, variable2, method)
variable1 |
Column name of blocking variable 1. |
variable2 |
Column name of blocking variable 2. |
method |
Desired blocking method. Possible values are |
The following methods are available for blocking:
'exact'
Simple exact blocking. All records with the same values for the blocking variable create a block. Searching for links is only done within these blocks.
'exactCL'
The same as 'exact'
. Only works with strings; all caracters are capitalised.
Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
DeterministicLinkage
,
ProbabilisticLinkage
,
SelectSimilarityFunction
,
StandardizeString
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the Jaro-Winkler similarity measure (first name) # and levenshtein distance (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "jw") l2 <- SelectSimilarityFunction("V8", "V8", method = "lv") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the Jaro-Winkler similarity measure (first name) # and levenshtein distance (last name) l1 <- SelectSimilarityFunction("V7", "V7", method = "jw") l2 <- SelectSimilarityFunction("V8", "V8", method = "lv") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
To call DeterministicLinkage
or ProbabilisticLinkage
it is mandatory to select a similarity function for each variable. Each element of the setup contains the two variable names and the method. For some methods further informations can be entered.
SelectSimilarityFunction(variable1, variable2, method = "jw", ind_c0 = FALSE, ind_c1 = FALSE, m = 0.9, u = 0.1, p = 0.05, epsilon = 0.0004, lower = 0.0, upper = 0.0, threshold = 0.85, jaroWeightFactor = 1.0, lenNgram = 2)
SelectSimilarityFunction(variable1, variable2, method = "jw", ind_c0 = FALSE, ind_c1 = FALSE, m = 0.9, u = 0.1, p = 0.05, epsilon = 0.0004, lower = 0.0, upper = 0.0, threshold = 0.85, jaroWeightFactor = 1.0, lenNgram = 2)
variable1 |
name of linking variable 1 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector. |
variable2 |
name of linking variable 2 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector. |
method |
linking method. Possible values are:
|
ind_c0 |
Only used for jw2. Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers. A nonzero value indicates the option is deactivated. |
ind_c1 |
Only used for jw2. All lower case characters are converted to upper case prior to the comparison. Disabling this feature means that the lower case string "code" will not be recognized as the same as the upper case string "CODE". Also, the adjustment for similar characters section only applies to uppercase characters. A nonzero value indicates the option is deactivated. |
m |
Initial m value for the EM algorithm. Only used when linking using |
u |
Initial u value for the EM algorithm. Only used when linking using |
p |
Initial p value for the EM algorithm. Only used when linking using |
epsilon |
epsilon is a stop criterum for the EM algorithm. The EM algorithm can be terminated when relative change of likelihood logarithm is less than epsilon. Only used when linking using |
lower |
Matches lower than 'lower' are classified as non-match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using |
upper |
Matches higher than 'upper' are classified as match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using |
threshold |
If using string similarities: Outputs only matches above the similarity threshold value. If using string distances: Outputs only matches below the set threshold distance. |
jaroWeightFactor |
By the Jaro weight adjustment the matching weight is adjusted according to the degree of similarity between the
variable values. The weight factor which determines the Jaro adjusted matching weight. Only used when linking using |
lenNgram |
Length of ngrams. Only used for the method ngram. Length of ngrams must be between 1 and 4. |
Calling the function will not return anything.
Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.
Winkler, W. E. (1988): Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods Vol. 667, American Statistical Association: 671.
DeterministicLinkage
,
ProbabilisticLinkage
,
SelectBlockingFunction
,
StandardizeString
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the jaro-winkler (first name) # and exact matching (last name) l1 <- SelectSimilarityFunction("V7","V7", method = "jw", ind_c0 = FALSE, ind_c1 = FALSE , m = 0.9, u = 0.1, lower = 0.0, upper = 0.0) l2 <- SelectSimilarityFunction("V8","V8", method = "exact") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the jaro-winkler (first name) # and exact matching (last name) l1 <- SelectSimilarityFunction("V7","V7", method = "jw", ind_c0 = FALSE, ind_c1 = FALSE , m = 0.9, u = 0.1, lower = 0.0, upper = 0.0) l2 <- SelectSimilarityFunction("V8","V8", method = "exact") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
Preprocessing (cleaning) of strings prior to linkage.
StandardizeString(strings)
StandardizeString(strings)
strings |
A character vector of strings to be standardized. |
Strings are capitalized, letters are substituted as described below. Leading and trailing blanks are removed. Other non-ASCII characters are deleted.
Replace "Æ" with "AE"
Replace "æ" with "AE"
Replace "Ä" with "AE"
Replace "ä" with "AE"
Replace "Å" with "A"
Replace "å" with "A"
Replace "Â" with "A"
Replace "â" with "A"
Replace "À" with "A"
Replace "à" with "A"
Replace "Á" with "A"
Replace "á" with "A"
Replace "Ç" with "C"
Replace "Ç" with "C"
Replace "Ê" with "E"
Replace "ê" with "E"
Replace "È" with "E"
Replace "è" with "E"
Replace "É" with "E"
Replace "é" with "E"
Replace "Ï" with "I"
Replace "ï" with "I"
Replace "Î" with "I"
Replace "î" with "I"
Replace "Ì" with "I"
Replace "ì" with "I"
Replace "Í" with "I"
Replace "í" with "I"
Replace "Ö" with "OE"
Replace "ö" with "OE"
Replace "Ø" with "O"
Replace "ø" with "O"
Replace "Ô" with "O"
Replace "ô" with "O"
Replace "Ò" with "O"
Replace "ò" with "O"
Replace "Ó" with "O"
Replace "ó" with "O"
Replace "ß" with "SS"
Replace "Ş" with "S"
Replace "ş" with "S"
Replace "ü" with "UE"
Replace "Ü" with "UE"
Replace "Ů" with "U"
Replace "Û" with "U"
Replace "û" with "U"
Replace "Ù" with "U"
Replace "ù" with "U"
Returns a character vector with standardized strings.
strings = c("Päter", " Jürgen", " Roß") StandardizeString(strings)
strings = c("Päter", " Jürgen", " Roß") StandardizeString(strings)
Apply Wolframs Cellular Automaton rule 30 on the input bit vectors.
WolframRule30(ID, data, lenBloom, t)
WolframRule30(ID, data, lenBloom, t)
ID |
IDs as character vector. |
data |
character vector containing bit vectors. |
lenBloom |
length of Bloom filters. |
t |
indicates how often rule 30 is to be used. |
Returns a character vector with new bit vectors after rule 30 has been applied t times.
https://en.wikipedia.org/wiki/Rule_30
Schnell, R. (2017): Recent Developments in Bloom Filter-based Methods for Privacy-preserving Record Linkage. Curtin Institute for Computation, Curtin University, Perth, 12.9.2017.
Wolfram, S. (1983): Statistical mechanics of cellular automata. Rev. Mod. Phys. 55 (3): 601–644.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vector e.g. by CreateCLK or CreateBF CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5")) # Apply rule 30 once res <- WolframRule30(CLK$ID, CLK$CLK, lenBloom = 1000, t = 1)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vector e.g. by CreateCLK or CreateBF CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5")) # Apply rule 30 once res <- WolframRule30(CLK$ID, CLK$CLK, lenBloom = 1000, t = 1)
Apply Wolframs Cellular Automaton rule 90 on the input bit vectors.
WolframRule90(ID, data, lenBloom, t)
WolframRule90(ID, data, lenBloom, t)
ID |
IDs as character vector. |
data |
character vector containing bit vectors. |
lenBloom |
length of Bloom filters. |
t |
indicates how often rule 90 is to be used. |
Returns a character vector with new bit vectors after rule 90 has been applied t times.
https://en.wikipedia.org/wiki/Rule_90
Martin, O., Odlyzko, A. M., Wolfram, S. (1984): Algebraic properties of cellular automata. Communications in Mathematical Physics, 93 (2): 219-258.
Schnell, R. (2017): Recent Developments in Bloom Filter-based Methods for Privacy-preserving Record Linkage. Curtin Institute for Computation, Curtin University, Perth, 12.9.2017.
Wolfram, S. (1983): Statistical mechanics of cellular automata. Rev. Mod. Phys. 55 (3): 601–644.
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vector e.g. by CreateCLK or CreateBF CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5")) # Apply rule 90 once res <- WolframRule90(CLK$ID, CLK$CLK, lenBloom = 1000, t = 1)
# Load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # Create bit vector e.g. by CreateCLK or CreateBF CLK <- CreateCLK(ID = testData$V1, data = testData[, c(2, 3, 7, 8)], k = 20, padding = c(0, 0, 1, 1), q = c(1, 1, 2, 2), l = 1000, password = c("HUh4q", "lkjg", "klh", "Klk5")) # Apply rule 90 once res <- WolframRule90(CLK$ID, CLK$CLK, lenBloom = 1000, t = 1)