- Database Overview
The National Housing Preservation Database is updated three times a year on March 31, July 31, and November 30. At these times, any updates made to source datasets by the dataset originator (such as the Department of Housing and Urban Development) are imported into the Database. A list of all data sources included in the National Housing Preservation Database and their most recent update date can be found here.
As data quality and data format vary by data source and each data source may contain duplicate property entries, automated procedures have been created to standardize imported data and reduce the number of incorrect or duplicate entries in the National Housing Preservation Database. This process is depicted below. During the import process, data inconsistencies that cannot be corrected through the automated process are flagged for manual cleaning. Manual cleaning takes place at each tri-annual data update and on an ongoing basis. The procedures used for both automated and manual cleaning are described below. Items with additional explanation are noted with a superscript.
Data in the National Housing Preservation Database undergo multiple cleaning processes to improve their accuracy.
Automated Cleaning Procedures: Automated cleaning procedures center on correcting property addresses and latitude and longitude values, as these fields are the primary matching keys for identifying and linking all of a property’s subsidies. As property addresses are imported into the Database, they are standardized according to USPS standard address protocols and extraneous characters or words appearing in addresses are attempted to be removed. Likewise, property names are standardized and extraneous characters are attempted to be removed. These procedures improve the rate of positive property matches between data sources.
Once addresses are standardized, they are entered into an address verification system. The system currently utilized is based on USPS CASS certification and Census geography files and is provided by SmartyStreets. A standard set of latitudes and longitudes is also generated from this system to ensure that the same geocoding method is used across all properties in the Database.
Manual Cleaning Procedures: Several types of data issues lead to manual review and cleaning. First, all properties that do not CASS certify with a valid USPS address in the automated cleaning process are flagged for manual review. These properties are checked using Google Maps to validate the address and are manually cross-checked to the National Housing Preservation Database to ensure that there are no duplicate properties located in the database once the address is updated. Several common address errors and their corresponding cleaning protocol are listed below.
Incomplete or Incorrect Address:
Solution: The correct property address is researched by Googling the apartment name and location to identify the official address and by using Google Maps to verify the address and identify a corresponding building footprint. Once a correct address is found, it is cross checked to the National Housing Preservation Database to ensure that a duplicate property with the correct address is not present. If a duplicate is found, the subsidy information for the duplicate properties are merged. Each of the corrected addresses are CASS certified using SmartyStreets. If the corrected address does not CASS certify the latitude and longitude provided by Google Maps, it is entered into the record. If an address is too incomplete to identify a property’s location, the property is flagged as ‘incomplete’ and remains flagged for cleaning. It cannot be updated until more information is received from the source data.
Solution: The address is viewed on Google Maps to determine that the building footprint is viewable. If the property address is confirmed, the latitude and longitude provided by Google Maps is entered into the record. If the footprint cannot easily be confirmed, the latitude and longitude provided by the source data is retained and the property remains flagged.
Second, all properties that have received a comment from users or staff using the comment function in the National Housing Preservation Database are flagged for manual review. Comments may pertain to incorrect address information as described above, indicate that a property is a duplicate property, state that a property’s name has changed, or other data issues that require change and verification. Several common duplicate and name discrepancy scenarios and their corresponding cleaning protocol are listed below.
Property is a Duplicate Entry:
Solution: The main property address is validated as described above. Then the subsidies located at the duplicate property are attached to the property with the valid address or more subsidies.
Solution: The properties are treated as separate. This is because properties with different phases also often have different property conditions and are funded independently of other phases.
Third, properties may be flagged for manual review if there are major inconsistencies in property information between data sources, but the property can be linked to another property from a different data source through the use of HUD IDs. Likewise, if there are major inconsistencies in property information from one update to the next in a particular data source, the property is flagged for manual review. The discrepancies are verified and if they cannot be verified, one source is chosen based on staff’s assessment of data quality and the radical nature of the change.
Property Name is Incorrect or Has Changed:
Solution: The correct name is verified using Google and Google Maps and is replaced.
Solution: The correct name is verified using Google and Google Maps. The subsidy from the older dataset with the incorrect name is linked to the property with the updated name from the newer dataset.