About the Author and SQLMatch(tm)
SQLMatch began as a series of SQL program scripts designed to automatically build a
linkage program for large Vital Statistics databases in MS SQL Server. There was not
enough time in the regular work day to write a new set of programs each time a new set of
tables were to be matched. And there was not enough money to purchase some of the very
expensive linkage programs available.
So, SQLMatch was developed as a way to quickly build a customized flexible probabilistic
linkage solution without writing thousands of lines of SQL code while still having that code
available for open modification.
And once the tables are linked and the possible matches are identified, it is easy to review
them, weed out the false positive matches, and allow SAS and other external programs to
have live access to the results.
Patrick Smith has 20 years of database programming experience, working as a consultant
as well as with the California Birth Defects Monitoring Program (CBDMP) in Berkeley,
California. He has been linking large datasets since 1999 and will be creating the
customized SQLMatch databases for you..
2007: 22 years (1983-2004) of newborn bloodspots were linked to VS birth records. 10.8
million were matched, which was 95% of possible matches, with an accuracy of 99.98%
when double-checked with other hand matched linkages done over the years. Recent years
had higher percentages. With more time, a higher percentage could be matched in the
2005-2006: 300,000 2003 and 2004 MSAFP TRF forms (form mother fills out at 4 months
gestation to go with blood sample test) were linked to 2003, 2004 and 2005 VS live birth
forms (1.5 million records) using SQL Match. Since there is a lag time of 4-5 months
between the time of the test and the birth of the baby, only test years 2003 and 2004 are
complete. Over 90% were found to be ‘highly probable’ (99%+ accuracy when hand
matched) matches. It took several hours to run this on an average 2 CPU server in 2005.
And what of the other 10%?
After reviewing information on the state Vital Statistics website, it was determined that in the
5 months between the test and the birth, at least 3-4% of mothers move out of the state. So
they will never be found when linking to California birth records. Also, some babies are not
born alive and will never be found in the fetal death records since only about VS fetal deaths
3,000 are recorded each year. In addition, there appeared to be many mistakes in the
sparsely populated MSAFP SSN field, even among those that do match. Mistakes, along with
numerous empty fields made for difficult linkage on these datasets above the 90% level.
2003-2005: 2003 and 2004 California MSAFP TRF forms were linked to 2003, 2004 and
2005 Vital Statistics AVSS (preliminary VS birth) forms. It was thought that we could do
linkage to births faster by using quarterly AVSS files rather than VS Births. But the AVSS files
were incomplete. The percentage results from the linkage were similar to the MSAFP to VS
live births, but AVSS has missing records and other problems.
2002-2007: Numerous datasets were linked, including older years of the preceding data, VS
Birth Cohorts, GDB datasets, and other collaborator datasets.
1999-2007 – Each year, our CBDMP birth defect babies were linked with VS Births and Fetal
deaths. This is a deterministic linkage that includes over 25 CBDMP variables and a full
hand match of marginal links, which account for about 20% of the total. We match about
97% of possible birth defect live birth cases with a certainty of well over 99%. This linkage
includes extensive QC rules regarding BPA6 coding, and other data validation. The linked
datasets become our birth defects registry. SQL Server views were created to the birth
defects registry and other datasets so that our Research Analysts and Epidemiologists could
create SAS datasets directly from the SQL Server Data. This ensured that the original data in
the numerous SAS datasets remained constant.
This linkage is now being accomplished with SQLMatch.