PowerPoint presentation
-
Upload
datacenters -
Category
Technology
-
view
399 -
download
0
Transcript of PowerPoint presentation
![Page 1: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/1.jpg)
Statistical Confidentiality: Statistical Confidentiality: Is Synthetic Data the Is Synthetic Data the
Answer?Answer?George George DuncanDuncan
2006 February 132006 February 13
![Page 2: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/2.jpg)
Acknowledging ColleaguesAcknowledging Colleagues Diane Lambert, GoogleDiane Lambert, Google Stephen Fienberg, Carnegie MellonStephen Fienberg, Carnegie Mellon Stephen Roehrig, Carnegie MellonStephen Roehrig, Carnegie Mellon Lynne Stokes, Southern MethodistLynne Stokes, Southern Methodist Sallie Keller-McNulty, RiceSallie Keller-McNulty, Rice Mark Elliot, Manchester, UKMark Elliot, Manchester, UK JJ Salazar, Universidad de La Laguna, SpainJJ Salazar, Universidad de La Laguna, Spain
![Page 3: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/3.jpg)
Acknowledging Current Acknowledging Current FundingFunding
NSFNSF, NISS Digital Government II, , NISS Digital Government II, Data Data Confidentialty, Data Quality and Data Confidentialty, Data Quality and Data Integration for Federal Databases: Integration for Federal Databases: Foundations to Software PrototypesFoundations to Software Prototypes
Agency Partners:Agency Partners:Bureau of Labor StatisticsBureau of Labor StatisticsBureau of Transportation StatisticsBureau of Transportation StatisticsCensus BureauCensus BureauNational Agricultural Statistics ServiceNational Agricultural Statistics ServiceNational Center for Education StatisticsNational Center for Education Statistics
![Page 4: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/4.jpg)
Questions AddressedQuestions Addressed What’s the R-U confidentiality map?What’s the R-U confidentiality map? What are synthetic data?What are synthetic data? Can the research community benefit Can the research community benefit
from synthetic data?from synthetic data? Source data—the Gold Standard?Source data—the Gold Standard? How should we evaluate a How should we evaluate a
synthesizer?synthesizer?
![Page 5: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/5.jpg)
Brokering Role of the Brokering Role of the Information OrganizationInformation Organization
Respondent
DATA
CAPTURE
Respondent
Policy AnalystDecision Maker
Media
Researcher
DataSnooper
DISSEMINTION
![Page 6: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/6.jpg)
Why Confidentiality MattersWhy Confidentiality Matters Ethical: Keeping Ethical: Keeping
promises; basic value promises; basic value tied to privacy tied to privacy concerns of solitude, concerns of solitude, autonomy and autonomy and individualityindividuality
Pragmatic: Without Pragmatic: Without confidentiality, confidentiality, respondent may not respondent may not provide data; worse, provide data; worse, may provide may provide inaccurate datainaccurate data
Legal: Required Legal: Required under lawunder law
![Page 7: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/7.jpg)
Confidentiality AuditConfidentiality Audit Sensitive objectsSensitive objects
Attribute valuesAttribute values Relationships Relationships
Susceptible dataSusceptible data Geographical detailGeographical detail Longitudinal or panel structureLongitudinal or panel structure OutliersOutliers Many attribute variablesMany attribute variables Detailed attribute variablesDetailed attribute variables Census versus survey/sample Census versus survey/sample Existence of linkable external databasesExistence of linkable external databases
![Page 8: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/8.jpg)
RestrictedData
RestrictedAccess
Making It SafeMaking It Safe
![Page 9: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/9.jpg)
RESTRICTED ACCESSRESTRICTED ACCESS Special Sworn Special Sworn
EmployeeEmployee Census BureauCensus Bureau
Licensed ResearchersLicensed Researchers National Center for National Center for
Education StatisticsEducation Statistics External SitesExternal Sites
California Census California Census Research Data CenterResearch Data Center
![Page 10: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/10.jpg)
On Line AccessOn Line Access
![Page 11: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/11.jpg)
On Line AccessOn Line Access
Restricted Access
Restricted Data
Restricted Access
![Page 12: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/12.jpg)
![Page 13: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/13.jpg)
Matrix MaskingMatrix Masking
Transforming the source data (Transforming the source data (XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
![Page 14: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/14.jpg)
Matrix MaskingMatrix Masking
Transforming the original data Transforming the original data ((XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
source data matrix
with n records and p attributesn pX
![Page 15: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/15.jpg)
Matrix MaskingMatrix Masking
Transforming the original data (Transforming the original data (XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
Row operator,so record transformation
Column operator, so attribute transformation
Additiveperturbation
![Page 16: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/16.jpg)
![Page 17: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/17.jpg)
Use X to estimate
XF
Generate samples from
XF̂
![Page 18: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/18.jpg)
Origins of the Synthetic Data Origins of the Synthetic Data IdeaIdea
Computer Science:Computer Science: Liew, C. K., Choi, U. J., and Liew, C. J. Liew, C. K., Choi, U. J., and Liew, C. J.
(1985) A data distortion by probability (1985) A data distortion by probability distribution, distribution, ACM Transactions on Database ACM Transactions on Database SystemsSystems 1010 395-411 395-411
Statistics:Statistics: Rubin, D. B. (1993), Satisfying Rubin, D. B. (1993), Satisfying
confidentiality constraints through the use confidentiality constraints through the use of synthetic multiply-imputed microdata, of synthetic multiply-imputed microdata, Journal of Official StatisticsJournal of Official Statistics 9191 461-468 461-468
![Page 19: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/19.jpg)
Further DevelopmentsFurther Developments Fienberg, S. E., Makov, U. E. and Steele, R. J. Fienberg, S. E., Makov, U. E. and Steele, R. J.
(1998) Disclosure limitation using perturbation (1998) Disclosure limitation using perturbation and related methods for categorical data. and related methods for categorical data. Journal Journal of Official Statisticsof Official Statistics 1414 347-360 347-360
Kennickell, Arthur B. (1999) Multiple imputation Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. and disclosure protection. Statistical Data Statistical Data Protection ’98Protection ’98 Lisbon 381-400 Lisbon 381-400
Now attention of other authors, particularly Little, Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, WoodcockRaghunathan, Reiter, Rubin, Abowd, Woodcock
My latest bibliography on SD has 31 entriesMy latest bibliography on SD has 31 entries
![Page 20: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/20.jpg)
What was the original What was the original purpose?purpose?
Public-use microdata file to allow user to make valid inferences about population parameters using straightforward statistical tools while protecting confidentiality (Rubin 1993)
![Page 21: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/21.jpg)
One Person’s AssessmentOne Person’s Assessment“… “… synthetic data sets which have all of the synthetic data sets which have all of the
statistical properties of the original data set, statistical properties of the original data set, but have entirely false data - made-up data, but have entirely false data - made-up data, so that you cannot break confidentiality so that you cannot break confidentiality because, in fact, any data set, any data because, in fact, any data set, any data record you have is a synthetic data record. …record you have is a synthetic data record. …
… … possibly the way of the future for lots of possibly the way of the future for lots of very, very confidential data, and maybe very, very confidential data, and maybe because the … the ability to protect because the … the ability to protect confidentiality … is being eroded by the confidentiality … is being eroded by the internet …this is probably where we are internet …this is probably where we are going to be driven to, although, I hope not.going to be driven to, although, I hope not.
---Norman Bradburn (2003)---Norman Bradburn (2003)
![Page 22: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/22.jpg)
Use X to estimate
XF
Generate samples from
How should we get the synthesizer?
XF̂
![Page 23: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/23.jpg)
Less-Ambitious Data-Use Less-Ambitious Data-Use PurposesPurposes
““Gain familiarity with the dataset structure, Gain familiarity with the dataset structure, develop code, and estimate analytical models—develop code, and estimate analytical models—compare against “gold standard file” compare against “gold standard file”
(Abowd and Lane 2003, Abowd 2005)(Abowd and Lane 2003, Abowd 2005)
“…“…people can send in their sort of model. They people can send in their sort of model. They can make up the synthetic data. You can go back, can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and so forth, and then after you’ve got everything and get your codes all right and get your SAS and get your codes all right and get your SAS Codes right, and then send it in and they will run Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the data - the real data, and they’ll send you back the results.” the results.”
(Bradburn 2003)(Bradburn 2003)
![Page 24: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/24.jpg)
R-U Confidentiality MapR-U Confidentiality Map
No Data
Data Utility U
Disclosure
Risk R
Original Data Maximum Tolerable
Risk
Released Data
![Page 25: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/25.jpg)
Disclosure Limitation Disclosure Limitation ParametersParameters
Specify extent of disclosure limitationSpecify extent of disclosure limitation Disclosure risk and data utility vary with Disclosure risk and data utility vary with
these parameter valuesthese parameter values Top-coding limitTop-coding limit Standard deviation of additive noiseStandard deviation of additive noise
Interpretation for synthetic dataInterpretation for synthetic data Extent released data are synthetic—partial Extent released data are synthetic—partial
synthetic data (Little, 1993)synthetic data (Little, 1993) Extent synthetic data matches source data Extent synthetic data matches source data
(e.g., outliers)(e.g., outliers)
![Page 26: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/26.jpg)
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality? Synthetic data record not
respondent’s actual data record, so identity disclosure is impossible
Attribute disclosure can happen Particularly with extreme values, it
may be possible to re-identify a source record
![Page 27: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/27.jpg)
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?
If simulated individuals have data values virtually identical to source individuals, possibility of both identity and attribute disclosure
(Fienberg 1997, 2003) If quasi-identifier attributes are
synthesized, re-identification can happen if data snooper can link an external identified data source using the quasi-identifier attributes
(Domingo-Ferrer et al 2005)
![Page 28: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/28.jpg)
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?
Because a synthetic data record is not any Because a synthetic data record is not any respondent’s actual data record, identity respondent’s actual data record, identity disclosure is directly impossibledisclosure is directly impossible
Attribute disclosure is still possibleAttribute disclosure is still possible But, particularly with extreme values, it may But, particularly with extreme values, it may
still be possible to re-identify a source recordstill be possible to re-identify a source record Some simulated individuals may have data Some simulated individuals may have data
values virtually identical to original sample values virtually identical to original sample individuals, so the possibility of both identity individuals, so the possibility of both identity and attribute disclosure remain (Fienberg and attribute disclosure remain (Fienberg 1997, 2003)1997, 2003)
Not fully, but it can appreciably lower disclosure risk
![Page 29: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/29.jpg)
Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is
synthesizedsynthesized Sophisticated research users must Sophisticated research users must
help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)
![Page 30: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/30.jpg)
Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is
synthesizedsynthesized Sophisticated research users must Sophisticated research users must
help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)
If we do it right
![Page 31: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/31.jpg)
Synthesizer BuildSynthesizer Build Synthesizer build involves Synthesizer build involves
constructing a statistical modelconstructing a statistical model But… model purpose not the usualBut… model purpose not the usual Not prediction, control or scientific Not prediction, control or scientific
understandingunderstanding Usual model construction exploits Usual model construction exploits
Occam’s Razor and seeks parsimonyOccam’s Razor and seeks parsimony
![Page 32: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/32.jpg)
Careful with Occam’s RazorCareful with Occam’s Razor "Everything should be made as simple "Everything should be made as simple
as possible, but not one bit simpler." as possible, but not one bit simpler." -- -- Albert EinsteinAlbert Einstein "Seek simplicity, and distrust it.“"Seek simplicity, and distrust it.“ -- -- Alfred North Alfred North
WhiteheadWhitehead
![Page 33: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/33.jpg)
Source Data not 24 Karat Gold Source Data not 24 Karat Gold Standard?Standard?
Steve Fienberg has notedSteve Fienberg has noted Sampled population often not target populationSampled population often not target population Coding errors, imputed missing dataCoding errors, imputed missing data
Do we really want to duplicate the statistical Do we really want to duplicate the statistical results obtainable from the source data? results obtainable from the source data? Match source dataMatch source data
Or, do we want to obtain statistical Or, do we want to obtain statistical inferences equally valid as those from the inferences equally valid as those from the source data? source data? Match source data goalMatch source data goal
![Page 34: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/34.jpg)
What posterior predictive What posterior predictive distribution for synthetic data?distribution for synthetic data?
““In actual implementations, the correct In actual implementations, the correct posterior predictive distribution is not posterior predictive distribution is not known, and an imputer-constructed known, and an imputer-constructed approximation is used.”approximation is used.” Jerry Reiter (2002)Jerry Reiter (2002)
What sampling distributions?What sampling distributions? What priors work best? What priors work best? What if the data analyst uses a prior very What if the data analyst uses a prior very
different from the synthesizer?different from the synthesizer?
![Page 35: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/35.jpg)
X
Y
13012011010090807060
400
350
300
250
200
Scatterplot of Y vs X
![Page 36: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/36.jpg)
Regression Analysis: Y versus X, X-squared
The regression equation isY = 6.61 + 3.05 X + 0.00062 X-squared
Predictor Coef SE Coef T PConstant 6.605 9.829 0.67 0.507X 3.0516 0.2044 14.93 0.000X-squared 0.000621 0.001046 0.59 0.558
S = 1.62190 R-Sq = 99.9%
![Page 37: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/37.jpg)
The regression equation isY = 0.88 + 3.17 X
Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000
S = 1.60303 R-Sq = 99.9%
![Page 38: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/38.jpg)
What should we use to What should we use to generate the synthetic data?generate the synthetic data?
Descriptive Statistics: X, Y
Variable N Mean StDev X 30 98.65 15.73 Y 30 313.85 9.12
![Page 39: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/39.jpg)
X
Perc
ent
14013012011010090807060
99
9590
80706050403020
105
1
Mean
0.952
98.65StDev 15.73N 30AD 0.154P-Value
Probability Plot of XNormal
![Page 40: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/40.jpg)
Usual Modeling Approach (non-Usual Modeling Approach (non-informative Bayes)informative Bayes)
Take Take
2
2
3198.65, 15.7330
31| 0.88 3.17 , 1.6030
X N
Y X x N x
![Page 41: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/41.jpg)
Sim X
Sim
Y
13012011010090807060
400
350
300
250
200
Scatterplot of Sim Y vs Sim X
![Page 42: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/42.jpg)
The regression equation isSim Y = 3.39 + 3.14 Sim X
Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000
S = 1.55825 R-Sq = 99.9%
![Page 43: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/43.jpg)
Compare with the “Gold Compare with the “Gold Standard” AnalysisStandard” Analysis
Based on Source DataBased on Source Data Based on Simulated Based on Simulated DataData
The regression equation isY = 0.88 + 3.17 X
Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000
S = 1.60303 R-Sq = 99.9%
The regression equation isSim Y = 3.39 + 3.14 Sim X
Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000
S = 1.55825 R-Sq = 99.9%
![Page 44: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/44.jpg)
RealityReality28 3 .001 ,
~ (0,1)three outliers
Y X XN
![Page 45: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/45.jpg)
So What’s So Bad?So What’s So Bad? Lost quadratic effectLost quadratic effect
Think of analyst with positive prior on Think of analyst with positive prior on thisthis
Lost outliersLost outliers
![Page 46: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/46.jpg)
Data Utility: Inference-Valid?
What does inference valid mean?What does inference valid mean? Same results as with original dataSame results as with original data Equal inference capability as original Equal inference capability as original
data? (Think like post-19data? (Think like post-19thth century century statistician)statistician)
![Page 47: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/47.jpg)
Is Inference-Valid Synthetic Data Possible?
““How robust are inferences to How robust are inferences to mis-specifications in the model mis-specifications in the model used to draw synthetic data?” used to draw synthetic data?” Jerry ReiterJerry Reiter
Method used in imputation must Method used in imputation must foresee complete-data analysesforesee complete-data analyses http://www.multiple-imputation.comhttp://www.multiple-imputation.com//
![Page 48: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/48.jpg)
Implementation is HardImplementation is Hard Model development time-consuming and
human-resource demanding, typically needing domain knowledge and statistical skills
Model is a simplification of reality—an incomplete image
Model selection/parameterization subjective
Data users’ models and methods more and more sophisticated (Bucher & Vckovski, 1995)
![Page 49: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/49.jpg)
Multivariate DifficultiesMultivariate Difficulties Capturing multivariate statistical
characteristics is time consuming Dandekar (2004)
Difficult to model joint distribution for several variables, especially in the presence of categorical variables Singh, Yu, and Dunteman (2003)
![Page 50: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/50.jpg)
Sample Survey DataSample Survey Data Generate synthetic data for sampled unitsGenerate synthetic data for sampled units
More disclosure riskMore disclosure risk Data utility?Data utility?
Generate synthetic data for population Generate synthetic data for population unitsunits Less disclosure riskLess disclosure risk Data utility?Data utility?
Preserve structure of sampling design?Preserve structure of sampling design? Singh, Yu, and Dunteman (2003)
![Page 51: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/51.jpg)
Usual Hard Problems Remain Usual Hard Problems Remain Hard!Hard!
Geographical detailGeographical detail Synthetic data for sampled units?Synthetic data for sampled units?
Longitudinal dataLongitudinal data Preserve complex relationshipsPreserve complex relationships Approximate ala Abowd and Woodcock Approximate ala Abowd and Woodcock
(2001)(2001) Target known to be in sampleTarget known to be in sample
Synthetic data for sampled units?Synthetic data for sampled units?
![Page 52: PowerPoint presentation](https://reader036.fdocumento.com/reader036/viewer/2022062400/5880c3681a28abba3b8b608d/html5/thumbnails/52.jpg)
Final MessagesFinal Messages Follow the R-U confidentiality mapFollow the R-U confidentiality map Don’t accept the source data as the Don’t accept the source data as the
Gold StandardGold Standard In sculpting a synthesizer, Occam’s In sculpting a synthesizer, Occam’s
Razor cuts too deeplyRazor cuts too deeply Implementing synthetic data is hard, Implementing synthetic data is hard,
so no panacea for microdata releaseso no panacea for microdata release