Procedure for Integrating New or Updated Data

The National Housing Preservation Database is updated three times a year on March 31, July 31, and November 30. At these times, any updates made to source datasets by the dataset originator (such as the Department of Housing and Urban Development) are imported into the Database. A list of all data sources included in the National Housing Preservation Database and their most recent update date can be found here.

As data quality and data format vary by data source and each data source may contain duplicate property entries, automated procedures have been created to standardize imported data and reduce the number of incorrect or duplicate entries in the National Housing Preservation Database. This process is depicted below. During the import process, data inconsistencies that cannot be corrected through the automated process are flagged for manual cleaning. Manual cleaning takes place at each tri-annual data update and on an ongoing basis. The procedures used for both automated and manual cleaning are described below. Items with additional explanation are noted with a superscript.

Notes

  1. If the address listed for the subsidy matches a current property address in the database, each appropriate field is checked for updated information. Information about which fields are checked for each independent source is listed in the National Housing Preservation Database Data Dictionary under “origin source.”
  2. Often addresses from these sources are incomplete and require manual update by our administrators. An example of an address that would need updating includes ‘2nd St and Park Ave’ to ‘123 2nd St,’ a valid mailing address. While the address is updated in the National Housing Preservation Database, the address is generally not updated by the organization that provides the data. Therefore original addresses remain saved to each property in addition to the corrected address to ensure that updated information on that property can be matched.
  3. All new addresses are United States Postal Service (USPS) verified for Coding Accuracy Support System (CASS) certification using the address verification system SmartyStreets, which uses USPS databases and Census geography files to validate address information. The methodological documentation provided by SmartyStreets can be found here.
  4. CASS stands for the Coding Accuracy Support System, and is used to correct and validate the accuracy of addresses. CASS certification verifies that the address meets the US Postal Service guidelines for mail delivery.
  5. Often property addresses will not CASS certify because there is not a mail receptacle bin at the property or an incomplete address was provided.
  6. Geocoding is the process of finding associated geographic coordinates, such as latitude and longitude, from other geographic data, such as street address. The National Housing Preservation Database uses SmartyStreets to determine the latitude and longitude of a given street address. The geocode for latitude and longitude provided by SmartyStreets is entered into the database as opposed to the geocode provided by the data source because it provides a standard geocoding method across all data sources imported into the National Housing Preservation Database.
  7. The geocode is verified to ensure that the matched state information, provided by the Census Tiger shape files, is in the same region as the property location. The location of the property is determined by the latitude and longitude of the property from SmartyStreets if it CASS certifies or the latitude and longitude of the provided by the data source if it does not CASS certify.
  8. The new property is flagged and sent to the administrative queue for a manual address check using Google Maps because either the latitude and longitude for the address provided by SmartyStreets does not match the location provided by the Census Tiger shape files, or there was no latitude and longitude available to match to Census Tiger shape files. Using the apartment name and provided address, Google Maps is used to verify a corresponding apartment building footprint. The new address will override the old incorrect address.
  9. These properties are required to be manually checked because the property address is not CASS certified for postal delivery, and an accurate latitude and longitude could not be pulled from SmartyStreets. When the property is updated, each appropriate field is checked for updated information. Information about which fields are checked for each independent source is listed in the National Housing Preservation Database Data Dictionary under “origin source.”

National Housing Preservation Database Data Cleaning Process

Data in the National Housing Preservation Database undergo multiple cleaning processes to improve their accuracy.

Automated Cleaning Procedures: Automated cleaning procedures center on correcting property addresses and latitude and longitude values, as these fields are the primary matching keys for identifying and linking all of a property’s subsidies. As property addresses are imported into the Database, they are standardized according to USPS standard address protocols and extraneous characters or words appearing in addresses are attempted to be removed. Likewise, property names are standardized and extraneous characters are attempted to be removed. These procedures improve the rate of positive property matches between data sources.

Once addresses are standardized, they are entered into an address verification system. The system currently utilized is based on USPS CASS certification and Census geography files and is provided by SmartyStreets. A standard set of latitudes and longitudes is also generated from this system to ensure that the same geocoding method is used across all properties in the Database.

Manual Cleaning Procedures: Several types of data issues lead to manual review and cleaning. First, all properties that do not CASS certify with a valid USPS address in the automated cleaning process are flagged for manual review. These properties are checked using Google Maps to validate the address and are manually cross-checked to the National Housing Preservation Database to ensure that there are no duplicate properties located in the database once the address is updated. Several common address errors and their corresponding cleaning protocol are listed below.

Incomplete or Incorrect Address:

  • Case 1: Address is incomplete and does not contain a house number. (Ex. Main St.)
  • Case 2: Address is a set of cross streets. (Ex. 5th and Vine)
  • Case 3 Address contains no street address. (ex. apartment name or city is repeated in street address line)
  • Case 4: Address contains misspellings (ex. 100 Mairn St., Phonix, AZ)
  • Case 5: Address contains incorrect information. (ex. 100 Main St., Phoenix, AR)
  • Solution: The correct property address is researched by Googling the apartment name and location to identify the official address and by using Google Maps to verify the address and identify a corresponding building footprint. Once a correct address is found, it is cross checked to the National Housing Preservation Database to ensure that a duplicate property with the correct address is not present. If a duplicate is found, the subsidy information for the duplicate properties are merged. Each of the corrected addresses are CASS certified using SmartyStreets. If the corrected address does not CASS certify the latitude and longitude provided by Google Maps, it is entered into the record. If an address is too incomplete to identify a property’s location, the property is flagged as ‘incomplete’ and remains flagged for cleaning. It cannot be updated until more information is received from the source data.

  • Case 6: Address does not offer a mail receptacle.

Solution: The address is viewed on Google Maps to determine that the building footprint is viewable. If the property address is confirmed, the latitude and longitude provided by Google Maps is entered into the record. If the footprint cannot easily be confirmed, the latitude and longitude provided by the source data is retained and the property remains flagged.

Second, all properties that have received a comment from users or staff using the comment function in the National Housing Preservation Database are flagged for manual review. Comments may pertain to incorrect address information as described above, indicate that a property is a duplicate property, state that a property’s name has changed, or other data issues that require change and verification. Several common duplicate and name discrepancy scenarios and their corresponding cleaning protocol are listed below.

Property is a Duplicate Entry:

  • Case 1: Properties with same street name and city/state, same or similar name, unit count is +/- 2.
  • Case 2: Properties with same property name, city/state, unit count +/- 2, different street address.
  • Solution: The main property address is validated as described above. Then the subsidies located at the duplicate property are attached to the property with the valid address or more subsidies.

  • Case 3: Properties with the street name, city/state, unit count +/-2, but different phases (Ex. Village Estates I and Village Estates II)
  • Solution: The properties are treated as separate. This is because properties with different phases also often have different property conditions and are funded independently of other phases.

Third, properties may be flagged for manual review if there are major inconsistencies in property information between data sources, but the property can be linked to another property from a different data source through the use of HUD IDs. Likewise, if there are major inconsistencies in property information from one update to the next in a particular data source, the property is flagged for manual review. The discrepancies are verified and if they cannot be verified, one source is chosen based on staff’s assessment of data quality and the radical nature of the change.

Property Name is Incorrect or Has Changed:

  • Case 1: Property name on Google Maps is different.
  • Case 2: Property name on official website site is different.
  • Solution: The correct name is verified using Google and Google Maps and is replaced.

  • Case 3: Property name is correct, but can be linked to a duplicate property with different address and different property name (most likely from a new data source).
  • Solution: The correct name is verified using Google and Google Maps. The subsidy from the older dataset with the incorrect name is linked to the property with the updated name from the newer dataset.