The SafeGraph Developer Hub

Welcome to the SafeGraph developer hub. You'll find comprehensive guides and documentation to help you start working with SafeGraph as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Places Manual

This document provides details on attribute methodology and answers frequently asked questions (FAQs) about the nuances of the SafeGraph Places dataset.

Also check out our Data Science Resources

Core Places

Places Scope

Core Places provides baseline information about every record in the SafeGraph product suite. The current scope of a place is defined as a location where consumers can spend money and/or time in the U.S. and Canada. This definition encompasses a broad swath of places ranging from restaurants, grocery stores, and malls; to parks, hospitals, and museums. In contrast, places like the SafeGraph Office, homes, and residential-only apartment buildings are not places that consumers visit, so these are not included in the SafeGraph dataset. The one exception to this general rule is industrial POIs. We are now sourcing POIs such as distribution centers, warehouses, and B2B equipment wholesalers as a result of popular demand. See the industrial POIs section for details.

To enable easy search for certain types of places, we group places into 6 digit NAICs codes based on available metadata (see the naics_code section below for more information). We rely on a machine learning model to accurately predict the naics_code of each place, and our model is constantly calibrating. In general, the better the metadata - the more accurate the prediction. Some NAICs codes are easier to predict than others due to the definition of the NAICs code itself, the volume of data, and the quality of data. For example, we are very confident in our ability to assign the proper naics_code to places that are Full-Service Restaurants (722511). However, we are relatively less confident in our ability to assign the proper naics_code to places that are "Sewing, Needlework, and Piece Goods Stores" (451130). If there are certain types of places missing from our current scope, or if there are other countries that would be valuable to include in our scope, please reach out to your SafeGraph rep or Contact Sales.

placekey

Placekey is a unique and persistent identifier for any physical place in the U.S. that intelligently partitions the ID into meaningful encodings. So how does Placekey work?

‍When both parts of a Plackey come together, the final result reads as [email protected] This is a unique way of shedding light on both the descriptive element of a place as well as its geospatial position in the physical world via a single identifier.

What: Address Encoding
The first three characters refer to the Address Encoding, creating a unique identifier for a given address. An address at “555 Main Street Suite 105” will have a different Address Encoding than “555 Main Street Suite 106.” However, "444 Second Street, Suite 4" will have the same address encoding as "444 2nd St. #4" to adjust for common address formats.

What: POI Encoding
The second set of three characters in the What Part refers to the POI Encoding. If a specific place has a location name (like "Central Park") and is already included in the Placekey reference datasets, these characters will be present. The benefit of the POI Encoding is that it can point to a specific Point-of-Interest that may have existed at a certain address at a given point in time.

Where: H3 Encoding
The Where Part, on the other hand, is made up of three unique character sequences, built upon Uber’s open source H3 grid system. This information in the Where Part is based on the centroid of that place. In other words, we take the latitude and longitude of a specific place and then use a conversion function to determine a hexagon in the physical world, representing about 15,000 sq. meters, containing the centroid of that place. The Where Part of the Placekey is, therefore, the full encoding of that hexagon.

Open access to your own datasets using the FREE Placekey API.

safegraph_place_id

The SafeGraph Place ID safegraph_place_id is a unique and persistent identifier for a place of interest and the original primary key across SafeGraph products. In the long-term, this column will be deprecated in favor of placekey.

safegraph_brand_id

  • SafeGraph curates over 5600 distinct brands (and growing). These are chains of commercial POIs that include all major brands in the United States (McDonald's, AMC, Macy's Chevrolet, Whole Foods Market, etc.).
  • ~1 million POIs are associated with at least one brand. Please note that ~80% of POIs have no brand associated as they are single commercial locations (local restaurants, museum, etc.). Please see Places Summary Statistics for more detail. SafeGraph is continually improving the fill rate of brands with each release -- please contact us if you notice a brand missing.
  • Some POIs include multiple brands. Car dealerships are a good example of this: a given dealership may sell multiple car brands. Another example is POIs that are co-located, such as some Taco Bell & KFC stores, or IMAX and AMC (or Regal, etc.) cinemas. In these cases the brands and brand_ids are listed as an array that is alphabetized by brand name and the order does not specify any importance.
  • Brands provide an easy way to isolate for only major stores. If you know you are searching for a brand that we cover, we advise searching the brand column instead of the name column. Even better is to search the brand_info file and build your workflows around safegraph_brand_id.
  • Every place has a name but only POIs belonging to a chain will have a brand. In certain cases, name and brand will be the same but in other cases, these fields may be different.
  • For example, if you’re searching for all McDonald’s (fast-food) stores, you would search for all POI entries where brands = ‘mcdonalds’. Many stores may be called mcdonalds that are not the fast-food chain and therefore searching for name = 'mcdonalds' would incorrectly return non-fast-food stores. Please note that you may see alternative names in the name column even once you filter for brands = 'mcdonalds', but these will all be McDonald's fast food stores.
  • Please note that if you are having difficulty matching location or brand names to listings of POI that you have, we offer a matching service that will provide you with SGPID's of locations mapped to your existing POI data.
Name Brand Comment
mcdonald's us mcdonalds
mcdonald's store mcdonalds You may occasionally get variations in the name column, but as long as you query the brand column correctly, you’ll get all the McDonald’s correctly.
mcdonald’s tractor supply store null This POI that is not a branch of the McDonald’s fast food chain is easy to notice because it has a null value in the brands column.

street_address

  • We implement a number of steps to clean, validate and standardize street_address.
  • You should expect street_address to be Title Cased, consistent and friendly for human reading. Please send us your feedback if you see otherwise.
  • If you care about street addresses as much as we do, we also have more specific address columns to split out address components. These are optional and available upon request for future deliveries!
    • primary_number
    • street_predirection
    • street_name
    • street_postdirection
    • street_suffix

naics_code, top_category, sub_category

  • SafeGraph Places uses the NAICS categorization taxonomy developed by the US Census Bureau that consists of a numeric NAICS code up to 6 digits in length.
  • The code itself is hierarchical; in other words, the first 2 digits describe a very general category, and additional digits describe more and more specific categories. For example:
    • 72 is the general category Accommodation and Food Services.
    • 722 is the more specific category Food Services and Drinking Places.
    • 7225 is the even more specific category Restaurants and Other Eating Places.
    • 722513 is the most specific category Limited-Service Restaurants (i.e. quick-serve or fast-food restaurants).
  • top_category and sub_category are the string labels associated with the first 4 digits and 6 digits of naics_code, respectively.
  • Category information is available for almost all of our POIs (see latest fill rate stats here).

As of the August-2020 Places release, we are consolidating a handful of 6-digit naics_codes into 4-digit naics_codes in cases where the 6-digit naics_code is too obscure to distinguish from adjacent 6-digit naics_codes in the same family. In these cases, the “sub_category” column will be null.

  • An example of 6-digit naics_codes we're consolidating are “Advertising Agencies (541810) and “Other Services Related to Advertising” (541890). Instead of maintaining these as separate 6-digit naics_codes that are not meaningfully differentiated, we will assign all POIs fitting either description to the 4-digit naics_code “Advertising, Public Relations, and Related Services” (5418).

A grand total of 20 six-digit naics_codes will consolidate to 9 four-digit naics_codes. It’s important to note that just because a 4-digit naics_code exists in our model does not guarantee it will show up in our data. For example, it is possible that zero POIs are assigned to naics_code = 3169 (Other Leather and Allied Product Manufacturing). However, 3169 is a possible naics_code value if a POI fitting that description exists.

A complete mapping of the 6-digit to 4-digit changes can be found below:

New naics_code (top_category)
Old naics_code (sub_category)

3169 (Other Leather and Allied Product Manufacturing)

316992 (Women's Handbag and Purse Manufacturing), 316998 (All Other Leather Good and Allied Product Manufacturing)

3231 (Printing and Related Support Activities)

323111 (Commercial Printing (except Screen and Books)), 323113 (Commercial Screen Printing), 323117 (Books Printing)

3369 (Other Transportation Equipment Manufacturing)

336991 (Motorcycle, Bicycle, and Parts Manufacturing), 336999 (All Other Transportation Equipment Manufacturing)

3399 (Other Miscellaneous Manufacturing)

339910 (Jewelry and Silverware Manufacturing)

5324 (Commercial and Industrial Machinery and Equipment Rental and Leasing)

532412 (Construction, Mining, and Forestry Machinery and Equipment Rental and Leasing), 532490 (Other Commercial and Industrial Machinery and Equipment Rental and Leasing)

5416 (Management, Scientific, and Technical Consulting Services)

541611 (Administrative Management and General Management Consulting Services), 541613 (Marketing Consulting Services), 541618 (Other Management Consulting Services), 541690 (Other Scientific and Technical Consulting Services)

5418 (Advertising, Public Relations, and Related Services)

541810 (Advertising Agencies), 541890 (Other Services Related to Advertising)

6233 (Continuing Care Retirement Communities and Assisted Living Facilities for the Elderly)

623311 (Continuing Care Retirement Communities), 623312 (Assisted Living Facilities for the Elderly)

7111 (Performing Arts Companies)

711110 (Theater Companies and Dinner Theaters),
711130 (Musical Groups and Artists)

Additionally, a total of four 6-digit naics_codes will migrate to a different 6-digit naics_code. See below for those changes:

New naics_code (sub_category)
Old naics_code (sub_category)

447110 (Gasoline Stations with Convenience Stores)

447190 (Other Gasoline Stations)

488190 (Other Support Activities for Air Transportation)

488119 (Other Airport Operations)

493110 (General Warehousing and Storage)

493190 (Other Warehousing and Storage)

611519 (Other Technical and Trade Schools)

611512 (Flight Training)

category_tags

  • Here is the full list of possible tags.. category_tags currently only supports POI belonging to NAICS "Food Services and Drinking Places" (i.e. the first three digits of the naics_code is 722). For these POI, SafeGraph provides a higher-granularity category informatio via category_tags. category_tags is a list of descriptive words about the POI. There is a fixed set of possible tags (see link above). There is no constraint on how many descriptors (tags) a POI can have; SafeGraph strives to label all relevant tags for every POI.

latitude & longitude

  • In general, latitude and longitude are defined by our best knowledge of the POI location. It is not designed to specifically locate the front door of the business, but rather defines the general center of the business.
  • Latitude and longitude still attempts to identify the individual business even if that business and others have the same polygon (e.g. strip mall).

open_hours

The new format for open hours is a JSON string with days as keys and opening & closing times (in the POI's local time) as values

  • Each JSON string is guaranteed to have all 7 days as keys
  • We indicate that a POI is closed for the day by giving it a value of "[]"
  • We indicate that a POI is open the entire day by using a format like:

"Thu": [["0:00", "24:00"]]

  • For POI that open and close multiple times throughout the day (e.g. a restaurant open in the morning and evening but not midday), we list multiple opening/closing pairs. For example:

“Sat": [["8:00", "13:00"], ["15:00", "22:30"]]

  • This indicates that a POI is open from 8 am to 1 p.m. and also from 3 p.m. to 10:30 p.m. on Saturday.

  • For POI that open and close on different days (e.g. a bar which opens at Tuesday at 6 p.m. and closes at Wednesday at 2 a.m.), we use a format like:

"Tue": [["18:00", "24:00"]], "Wed": [["0:00", "2:00"]]

To re-iterate: a “closing time” of 24:00 doesn’t mean the POI actually closes at midnight, if it’s followed by an opening time of 0:00 on the following day.

Example Open Hours JSON string

{ "Mon": [["8:00", "22:00"]], "Tue": [["8:00", "13:00"], ["18:00", "24:00"]], "Wed": [["0:00", "2:00"]], "Thu": [["0:00", "24:00"]], "Fri": [["23:00", "24:00"]], "Sat": [["0:00", "3:00"], ["15:00", "22:30"]], "Sun": [] }

This example represents the following open / close times:

  • Open from 8 a.m. to 10 p.m. on Monday
  • Open from 8 a.m. to 1 p.m. and 6 p.m. onwards on Tuesday
  • Open until 2 a.m. on Wednesday (note: open from Tuesday 6pm through 2am Wednesday)
  • Open all day on Thursday (i.e. midnight Wednesday to midnight Thursday)
  • Open from 11 p.m. onwards on Friday
  • Open until 3 a.m. and between 3 p.m. and 10:30 p.m. on Saturday
  • Closed on Sunday

phone_number

This is a 10 digit phone number. We filter out toll-free numbers (e.g. 1-800) and strive to have POI-specific numbers (not franchise-level or corporate-level numbers).

opened_on, closed_on, tracking_opened_since, tracking_closed_since

opened_on and closed_on dates are determined from metadata at the source level. If a new POI from an existing source repeatedly appears in our build pipeline, it is flagged as opened_on during the month in which it first appears. Similarly, if a POI from an existing source repeatedly disappears in our build pipeline, it is flagged as closed_on during the month in which it first disappears. These flags are added to the Places product permitting final QA checks and overall data hygiene.

Temporary closures are not captured in open/close tracking, and it became difficult to distinguish permanent closures from temporary closures at the onset of covid-19. This resulted in a relatively low count of POIs with closed_on values between "2020-03" and "2020-06" as we erred towards the side of caution to not mistakenly mark temporarily closed businesses as permanently closed. We are also aware that our closed_on values are over-indexed on "2020-01" as this was the first month we began tracking internally.

Some POIs have not yet been sourced consistently enough to provide the metadata needed to determine opened_on or closed_on dates. These POIs will have a null value in the tracking_opened_since and/or tracking_closed_since columns. If a date is provided in the tracking_opened_since and/or tracking_closed_since column, it indicates that we have enough metadata since that date to infer an opening or closing if the POI happened to open or close since that date. To be clear, a value in the tracking_closed_since column does not mean the POI closed on that date, and a value in the tracking_opened_since column does not mean the POI opened on that date. In general, the SafeGraph Places product tracks opened_on and closed_on dates from as early as 2019-07 onward, and therefore, the majority of POIs that have a tracking_closed_since date will show a value of "2019-07."

Please note that opened_on, closed_on, tracking_opened_since, and tracking_closed_since columns are specific to Core Places. These are not available in stand-alone Geometry or Patterns purchases. If Core is purchased in combination with Geometry and/or Patterns, the Geometry and Patterns specific fields will be null for any POIs with a closed_on date. Please reference the Column Ordering section of the Places Manual for details on where these columns exist per product combination.

Industrial POIs

The Industrial POIs in SafeGraph Places include distribution centers, warehouses, manufacturing plants, and B2B equipment wholesalers and are found in the following naics_codes | safegraph_subcategory:

  • 336111 | Automobile Manufacturing
  • 336120 | Heavy Duty Truck Manufacturing
  • 336411 | Aircraft Manufacturing
  • 423810 | Construction and Mining (except Oil Well) Machinery and Equipment Merchant Wholesalers
  • 423820 | Farm and Garden Machinery and Equipment Merchant Wholesalers
  • 423830 | Industrial Machinery and Equipment Merchant Wholesalers
  • 423840 | Industrial Supplies Merchant Wholesalers
  • 423910 | Sporting and Recreational Goods and Supplies Merchant Wholesalers
  • 493110 | General Warehousing and Storage

The following brands include industrial POIs but feature a more common naics_code:

  • Fastenal (SG_BRAND_50660a822ab2e47648b72750d4256f7b) | 444130 | Hardware Stores
  • International Trucks (SG_BRAND_4bc53bddaed9ceb9) | 441110 | New Car Dealers
  • Freightliner Trucks (SG_BRAND_05dfc342d2c7e2dcece0bc400f2e50a4) | 441110 | New Car Dealers
  • Peterbilt (SG_BRAND_1d86c306ba9b0950) | 441110 | New Car Dealers
  • Volvo Trucks (SG_BRAND_e79ba8c9826b5c84) | 441110 | New Car Dealers
  • Mack Trucks (SG_BRAND_309d9b9d43abfe5a9ef665f263348b02) | 441110 | New Car Dealers

Geometry

wkt

  • Spatial Reference used: EPSG:4326
  • WKT stands for Well-Known-Text. It’s a simple way to define a polygon/shape and is the standard format for polygons in SafeGraph Places.
  • Other geospatial file formats you may utilize include Shapefile and GeoJSON. WKT can easily be converted to these formats and file conversions are available by request.

Spatial Hierarchy

  • Some POIs are characterized by a broader footprint and cannot be represented by the outline of a single building. These types of POIs often encompass smaller POIs within their borders, and we try to flag where these overlapping relationships exist in the real world by setting the parent_safegraph_place_id of the smaller POI equal to the safegraph_place_id of the larger, encompassing POI. We colloquially refer to the larger, containing POI as the "parent" and the smaller POI as the "child."
  • If a POI is not contained by an overlapping polygon, the parent_safegraph_place_id will be null. Only POIs of particular categories can qualify as "parent" POIs with the exception of brands wholly containing other brands (ex: a Subway within Wal-Mart). Below is the full list of sub_categories and corresponding naics_codes that have the potential to be parents:
 * Gasoline Stations with Convenience Stores (447110)
 * Malls (531120)
 * Elementary and Secondary Schools (611110)
 * Nature Parks and Other Similar Institutions (712190)
 * Colleges, Universities, and Professional Schools (611310)
 * Correctional Institutions (922140)
 * Junior Colleges (611210)
 * Golf Courses and Country Clubs (713910)
 * Casinos (except Casino Hotels) (713210)
 * Casino Hotels (721120)
 * Hotels (except Casino Hotels) and Motels (721110)
 * Other Technical and Trade Schools (611519)
 * Amusement and Theme Parks (713110)
 * General Medical and Surgical Hospitals (622110)
 * Family Planning Centers (621410)
 * All Other Outpatient Care Centers (621498)
 * Freestanding Ambulatory Surgical and Emergency Centers (621493)
 * Kidney Dialysis Centers (621492)
 * Sports Teams and Clubs (711211)
 * Promoters of Performing Arts, Sports, and Similar Events with Facilities (711310)
 * Skiing Facilities (713920)
 * Other Airport Operations (488119)
  • In rare cases, a parent POI can have a parent. Examples include
    • starbucks > airport terminal > airport
    • subway > walmart > shopping center
    • physician's office > outpatient care center > regional medical campus

polygon_class

  • We provide a polygon_class for each POI to identify how well the polygon reflects the POI itself.
  • In dense environments, such as indoor malls or multi-story buildings, we may not be confident about a POI’s true shape, so we provide the overall structure polygon instead. This results in several POIs sharing the same polygon. In other cases, we simply may not have a unique polygon for each POI, so several POIs end up sharing the same polygon. In each of these cases, the POIs would be classified as "SHARED_POLYGON" in the polygon_class column.
  • If a single POI maps to a distinct polygon (excluding that POI's children), then the POI is classified as "OWNED_POLYGON" in the polygon_class column. We exclude children from influencing a POI's polygon_class because in cases where a unique polygon is not available for a child POI, the child POI most likely maps to the parent POI's polygon; however, that does not mean the polygon is not a good representation of the parent itself. A good example of this would be a Nike store inside of a shopping mall. If we don't have a good polygon for the Nike store, then the Nike store may share the same polygon as the mall, but the polygon for the mall is still representative of the Mall's shape and size. For more details on parent-child relationships, see the Spatial Hierarchy section above.
  • If you need to differentiate unique stores within a shared polygon, you should use the POI centroids (the latitude and longitude columns). Since user GPS signals often drift inside of large structures, for use cases such as determining places visited by a user, we have found that user distance to centroid is a good substitute for distance to polygon.

includes_parking_lot

  • In some cases, our polygons intentionally include the parking lot since the parking lot (e.g., car dealerships and gas stations). The value of the includes_parking_lot column is to make explicit to our customers when the polygon_wkt does or does not include the parking lot. There are three possible values true, false, and null (null when we are not sure whether a parking lot is included in the geometry).

is_synthetic

  • We strive for precise polygons for nearly all of our places, but in some cases, we have not yet sourced an accurate polygon and will instead infer a synthetic polygon from an accurate centroid, category-based radius, and heuristics like avoiding overlap with roads. In these cases, is_synthetic = "true." For some place categories, it does not make sense to provide a precise polygon. Those categories are listed below:
    • Cemeteries and Crematories (812220)

building_height

  • This is the height above ground of the building in meters. Most POIs have a building polygon, but in some cases, a POI's polygon does not reflect a building (like a golf course, park, shopping mall, etc.), and building_height will always be null in these cases.

enclosed

  • Within spatial hierarchy, we are particularly interested in determining when a child POI is completely enclosed indoors by its parent. As a general guideline, enclosed = “TRUE” if you must enter the parent POI’s structure in order to arrive at the child POI.

  • The enclosed column serves three key functions:

1.) Provides a deeper classification of spatial hierarchy relationships.

2.) Informs our visits algorithm when to exclude a POI from Patterns. Within major indoor structures, a POI’s true footprint can be hard to discern, and the horizontal accuracy of GPS data deteriorates dramatically. For these reasons, we are reluctant to assign visits to enclosed POIs and instead roll the visits up to the parent POI (see here for more on visits to parent POIs). We have always tracked enclosed POIs internally, and we are externalizing this concept for complete transparency around when to expect visits at a given POI. The children of the following parent POIs are currently set to enclosed = “TRUE:”

  • Hotels (naics_code = 721110 OR 721120) when the child POI is a restaurant or bar (naics_code = 722514 OR 722515 OR 722513 OR 722511)
  • Stadium/Arena (naics_code = 713910 OR 711310)
  • Large medical facilities (naics_code = 621498 OR 622110 OR 621410 OR 621493 OR 621492)
  • Airports and airport terminals (naics_code = 488119)
  • Unless the child is an airport terminal POI
  • Indoor shopping malls (naics_code = 531120). To be clear, children of open air, outdoor shopping centers will have enclosed = “FALSE.”

Other than naics_code relationships, we track enclosed through known brand relationships where Brand A exists completely within Brand B. A canonical example: Brand A = Subway (SG_BRAND_04a8ca7bf49e7ecb4a32451676e929f0) and Brand B = Walmart (SG_BRAND_de80593878cb1673c62a7f338dc7e4e1). If a Walmart and Subway are co-located, the parent_safegraph_place_id for Subway = the Walmart’s safegraph_place_id, and enclosed = “TRUE” for Subway.

3.) Tells us where we should strive for polygon_class = “OWNED_POLYGON.” If a POI has enclosed = “FALSE,” then there are reasonable means to build a polygon that represents the unique footprint of the POI. This is not always true - but generally true.

Patterns

raw_visit_counts

These are the aggregated raw counts that we see visit the POI from our panel of mobile devices.

These values should be taken in the context of specific nuances & biases in our dataset:

Geographic Bias

  • Small geographic bias exists in our panel based on our understanding of the home locations of the devices in the panel.
  • SafeGraph tested for geographic bias by comparing its determination of the state-by-state numbers of home location of the devices in the panel to the true proportions reported by the 2016 US Census.
  • Based on that analysis, SafeGraph panel density closely mirrors true population density. Overall average percentage point difference < 1%. Maximum +/-3% per state.
  • For a deep dive on geographic bias in the panel, see Quantifying Sampling Bias in SafeGraph Patterns.

Panel Growth

  • The panel has grown significantly since its inception. As such, it is important to normalize the data when doing time series analysis across long periods of time or multiple releases.
  • We have seen success by normalizing visits by the total number of visits in the SafeGraph Panel, month by month. It is also worth exploring normalizing based on state or census block group. With each delivery, we provide you with the Panel Overview Data files to enable you to do these calculations.

Predicting Financial Indicators

SafeGraph data can be used to estimate foot traffic and predict financial indicators of companies ( number of visitors, revenue, etc.). Please see [our Data Science Resources on Normalization] ( https://docs.safegraph.com/docs/data-science-resources#section-panel-normalization-for-longitudinal-analysis-sampling-bias-corrections-and-extrapolation) and our Normalization White Paper: How to Use SafeGraph Visits Data to Predict Company Reported KPIs.

Correlation between reported company KPIs and SafeGraph visits will vary depending on multiple factors related to the company:

  • Does the business separately report online vs in-store sales and revenue.
  • How much do online sales contribute to the overall revenue.
  • How much revenue is generated outside of the USA (SafeGraph Visits are US only).
  • The ground truth correlation between foot traffic and sales for that business. e.g. the relationship between foot traffic and sales at a car dealership has a very different pattern than at a convenience store.

Visits to Dense Urban Areas

  • Visits to urban, suburban, and rural areas have varying precision levels. It is more difficult to measure visits to a midtown Manhattan Starbucks than a visit to a suburban standalone Starbucks.

Visits to POIs within Large Structures/Indoor Malls

  • We attribute visits to the parent POI only when we determine that the parent completely encloses its children indoors. We believe this is the most accurate option given the limitations of GPS inside such structures and currently do this for the following parent POI types:

    • Airports
    • Indoor Malls (not including outdoor, open air malls)
    • Major Medical Facilities
    • Hotels
    • Casinos

Visits to Parent POIs

  • We attribute visits to both the parent POI and its children if the parent does not contain the children indoors. Therefore, if you count visits at the parent level and then again for all children, you are double counting visits.

  • In other words, raw_visit_counts at a parent POI = SUM(raw_visit_counts) at all child POIs + other visits that the parent picks up itself (not through any of its children). For example, a shopping center POI may have raw_visit_counts = SUM(raw_visit_counts) for all of its children because there are not any gaps within the shopping center to visit without the presence of a child POI. On the other hand, a parent POI, like a golf course, will likely have more visits than the sum of its children because there are plenty of places within the golf course to visit without making a visit to a child POI (ex: playing 18 holes on the green but not making a visit to the club house or restaurant).

Visits to Strip Malls

  • We attribute visits to the individual stores as well as the parent strip mall (assuming we have a POI for the entire strip mall). There will be instances where we have not divided a strip mall polygon into its constituent stores. Our model to determine visits does take a number of factors into account, including distance from centroid, so even though there are multiple POI in one strip mall polygon, we attempt to allocate visits within the strip mall to the POI most likely to have received the visit.

Worker & Non-Worker Visits

  • Prior to our May 2020 release, we attempted to exclude workers at a POI from our visit counts to the POI. However, we were only able to determine a limited number of workers so we decided to remove this filter. The best way to determine how many of the visitors to a POI are workers is by looking at the bucketed_dwell_time column.

GPS Data

  • The visits are determined using GPS data.
  • We do not include any GPS data with a horizontal accuracy greater than 160 meters.

Very long visits

  • Sometimes a visit lasts a very long time ( e.g., > 24 hours). Please see our Documentation on how Patterns handles Long Visits for a number of nuances about these edge scenarios.
  • If you are seeing visits to a POI that are longer than expected, this is likely due to picking up an employee device or picking up someone in a place above a POI (such as residential or retail over office).

🚨Artifacts or Known Data Issues in the Data 🚨

  • We have noted that the visit counts on 4/21/2019 (Easter) are lower than expected. We had a supply issue at that time that seemed to have decreased the number of visits we saw.
  • See Known Data Issues or Artifacts

raw_visitor_counts

  • These are the aggregated raw counts of visitors.

visits_by_day

  • This is an array of visits on each day in the month.
  • We are breaking up days based on local time.

poi_cbg

  • This is the census block group that the POI resides within as determined using the centroid of the POI.

visitor_home_cbgs

  • These are the home census block groups of the visitors to the POI.
  • For each census block group, we show the number of associated visitors (as opposed to the number of visits).
  • We do not have a home census block group for each visitor and not each visitor originates from the U.S. The number of U.S. visitors listed in the visitor_country_of_origin column represents the total number of visitors which we have determined originate from the U.S.
  • We apply differential privacy to the counts. We do not include a census block group unless there are at least 2 visitors from that census block group. If there are between 2 and 4 visitors from a census block group, we show this as 4.
  • See our home algorithm documentation on how we estimate home location. We determine the home census block group by analyzing 6 weeks of data during nighttime hours (between 6 pm and 7 am). We require a sufficient amount of evidence (total data points and distinct days) to assign a home (common nighttime) location for the device.
  • The census block group is the highest geographic resolution for which the US Census provides demographic information. This demographic data is publicly available through APIs maintained by the US Census. SafeGraph provides census block group demographic data to download for free. There are also resources for developers on Github and Stackoverflow for working with the US Census APIs. Some of the most common APIs are the Population Estimates API and the Decennial Census
  • See also: How do I work with Patterns columns that contain JSON

visitor_work_cbgs

  • These are the work census block groups of the visitors to the POI.
  • For each census block group, we show the number of associated visitors (as opposed to the number of visits).
  • We determine the work census block group of a device by looking at 1 month of data and determining where the device is most frequently during traditional work hours and is not during the weekend or overnight. It is easier to determine a home/common nighttime location of a device than it is to determine the work census block group so our data contains more home_cbgs than work_cbgs.
  • We apply differential privacy to the counts. We do not include a census block group unless there are at least 2 visitors with a work census block group in that census block group. If there are between 2 and 4 visitors with a work census block group in a census block group, we show this as 4.
  • See also: How do I work with Patterns columns that contain JSON

visitor_country_of_origin

  • These are the countries of origin of the visitors to the POI.
  • We determine the country of origin by analyzing 6 weeks of data during nighttime hours (between 6 pm and 7 am). We require a sufficient amount of evidence (total data points and distinct days) to assign a common nighttime location for the device. The country of this common nighttime location is the specified country of origin.
  • We apply differential privacy to the counts. We do not include a country unless there are at least 2 visitors from that country. If there are between 2 and 4 visitors from a country, we show this as 4.
  • See also: How do I work with Patterns columns that contain JSON

distance_from_home

  • This is the median distance from home to the POI in meters for the visitors we have identified a home location.
  • If we have fewer than 5 visitors to a POI, the value will be null.
  • We do not adjust for visits -- each visitor is counted equally.

median_dwell

  • This is the median of the minimum dwell times we have calculated for each of the visits to the POI.
  • We determine the median dwell time by looking at the first and last ping we see from a device during a visit. This is a minimum dwell because it is possible the device was at the POI longer than the time of the last ping.
  • It is possible to have a minimum dwell of 0 if we only saw 1 ping and determined the visit based on factors such as wifi.

bucketed_dwell_times

related_same_day_brand

These are the brands that the visitors to this POI visit (on the same day that they visit the POI) in higher numbers than the general members of our panel. The number mapped to each brand is an indicator of how highly correlated a POI is to a certain brand beyond what we are seeing generally in the panel. For example, if a lot of visitors to Starbucks at 123 Main Street also tend to visit an unpopular brand on the same day, the number could be quite high (e.g., > 50) whereas if the same number of 123 Main Street Starbucks visitors also visit Targets on the same day, the number will be lower because Target is a popular brand.

See also: How do I work with Patterns columns that contain JSON

If you want to know the nitty-gritty of how we calculate this index, read on at your own risk:

  • For each day in the month, we find the total number of visitors who went to both the POI and another branded location. For each brand, we divide this number by the total number of visitors to the POI. This gives us our "POI Specific Brand Ratios" for each brand for each day in the month.
  • For each day in the month, we find the number of visitors who went to each brand divided by the total number of visitors in the Panel. This gives us our "Baseline Brand Ratios" for each brand for each day in the month.
  • For each brand, we take the POI Specific Brand Ratio for each day of the month and subtract from it the corresponding Baseline Brand Ratio (the "Daily Percentage"). We then take the median of the differences. If the result is greater or equal to 5%, we include the brand in the list.
  • In determining the median we exclude any POI Specific Brand Ratios that are 0.
  • Note that the final number is rounded so it is possible to have 100 (likely because the applicable Baseline Brand Ratio is less than 0.5%).

For example, if on the first of the month, 20 visitors out of 100 that went to a certain SoulCycle POI also went to a Sephora while in the Panel generally, only 2 out of 100 visitors went to a Sephora, the Daily Percentage would be 18% (20/100 - 2/100). This 18% would be included with the other Daily Percentages for the month to determine the median of those numbers.

related_same_month_brand

These are the brands that the visitors to this POI visit in higher numbers than the general members of our panel over the course of the month. The number mapped to each brand is an indicator of how highly correlated a POI is to a certain brand beyond what we are seeing generally in the panel. For example, if visitors to Starbucks at 123 Main Street also tend to visit an unpopular brand a lot, the number could be quite high (e.g., > 50) whereas if the same number of 123 Main Street Starbucks visitors also visit Targets a lot, the number will be lower because Target is a popular brand.

See also: How do I work with Patterns columns that contain JSON

If you want to know the nitty-gritty of how we calculate this index, read on at your own risk:

  • For the entire month, we find the total number of visitors who went to both the POI and another branded location. For each brand, we divide this number by the total number of visits to the POI. This gives us our "POI Specific Brand Ratios" for each brand.
  • For the entire month, we find the number of visitors who went to each brand divided by the total number of visitors in the Panel. This gives us our "Baseline Brand Ratios" for each brand.
  • For each brand, we take the POI Specific Brand Ratio and subtract from it the corresponding Baseline Brand Ratio. If the result is greater or equal to 5%, we include the brand in the list.
  • Note that the final number is rounded so it is possible to have 100 (likely because the applicable Baseline Brand Ratio is less than 0.5%).

For example, if for the entire month 20 visitors out of 100 that went to a certain SoulCycle POI also went to a Sephora while in the Panel generally, only 2 out of 100 visitors went to a Sephora, the percentage would be 18% (20/100 - 2/100).

popularity_by_hour

  • This is an array of visits seen in each hour of the day over the course of the month.
  • Local time is used.
  • If a visitor stays for multiple hours, an item in the array will be incremented for each hour during which the visitor stayed. This means that if you sum the numbers in the popularity_by_hour array the sum will likely be greater than the amount shown in the raw_visit_counts column (since the raw_visit_counts counts a multiple hour visit as one visit).

popularity_by_day

device_type

  • This shows how many visitors to the POI use android vs. iOS.
  • We apply differential privacy to the counts. We do not include a device type unless there are at least 2 visitors with that device type. If there are between 2 and 4 visitors with that device type, we show this as 4.
  • See also: How do I work with Patterns columns that contain JSON

carrier_name

  • This is a premium column that maps wireless carrier names to the number of visitors to the POI whose device uses that wireless carrier. Only carrier_names with at least 2 devices are shown, and carrier_names with less than 5 devices are reported as 4 per our differential privacy practices. Below is a breakdown of our panel of devices by wireless carrier as of the July-2020 release:
Carrier
Count
Ratio

T-Mobile

7,129,894

24.66%

C-Spire

204,800

0.71%

Altice

323,221

1.11%

Sprint

3,685,988

12.75%

AT&T

7,267,146

25.13%

Verizon

10,303,871

35.64%

Known Data Issues or Artifacts

We strive for full transparency for any known data issues that may affect your analysis, not otherwise accounted for in monthly release notes. If you notice a problem that is not listed here, please send us your observations so we can investigate.

Date Reported
Description
Discussion/Links
Resolved?

3/1/2020

2/25/2020 Artifact (affecting SDM and Patterns)

4/2/2020

Problem with IOWA CBG 190570010001

Yes. This CBG has been removed from SDM.

4/6/2020

Duplicate CBGs with Different States

Yes. Ignore State in home-panel-summary and aggregate within CBG. Product fix coming soon.

4/13/2020

CBG FIPS are corrupted for some rows in Open Census Data file cbg_b22.csv

Unfortunately, there is no timeline for fixing this. Apologies for the inconvenience. However Consortium members can see Jonas Peeters solution.

6/30/2020

opened_on column over-indexed on 2020-01

7/7/2020

Several inexplicably abnormal days of data in 2018. Dates affected: 3/15/2020, 9/15/2020, 9/16/2020

No fix in medium term. Short term workaround is to omit completely if possible. Otherwise, replace with median imputation or some other method so the days have no impact on analysis.

8/30/2020

4/21/2019 (Easter) may be an anomalous day in Patterns data.

We had a supply issue at that time that seemed to have decreased the number of visits collected artificially.

Actively investigating. Workaround is to ignore data from this day.

11/18/2020

In Social Distancing Metrics (and possibly other datasets) there are an abnormal number of records showing travel to/from parts of Kansas. This is likely due to a GPS data problem related to the the "center of the country" issue known to influence a very small minority of location data when non-GPS data is inadvertently mixed with GPS data.

SafeGraph is always working to ensure the highest quality location data is used to build its products and we are always working to improve artifacts like this one.

Privacy

To preserve privacy, we apply differential privacy techniques to the following columns: visitor_home_cbgs, visitor_daytime_cbgs, visitor_work_cbgs, visitor_country_of_origin, device_type, carrier_name.
We have added Laplacian noise to the values in these columns. After adding noise, only attributes (e.g., a census block group) with at least two devices are included in the data.
For many columns we do not report data unless at least 2 visitors are observed from that group. If there are between 2 and 4 visitors this is reported as 4. e.g., see visitor_home_cbg

SafeGraph Common Nighttime Location Algorithm

Many columns in SafeGraph datasets rely on estimates of device “home location” (e.g., at the level of a census block group). “Home location” is an abbreviation for “common nighttime location”. For Monthly and Weekly Patterns these columns are visitor_home_cbg, distance_from_home, visitor_country_of_origin, and the home_panel_summary.csv and normalization_stats.csv. For Neighborhood Patterns these columns are device_home_areas, and all of the *_device_home_areas columns as well as distance_from_home and the home_panel_summary.csv and normalization_stats.csv. For Social Distancing Metrics all the data reported rely on an estimate of the “home location” for a device, listed in the column origin_census_block_group.

How does SafeGraph estimate home location?

SafeGraph uses historical data to estimate a common nighttime location (e.g., a census block group) over a 6 week window, for each device. As of July 2020, SafeGraph uses the Home Algo v2 “Incremental Updates” (see below). Previously a different algorithm was used (Home Algo v1 “Monthly Batched”), and this algorithm is also documented below for insight into the rationale and product evolution at SafeGraph.

Home Algo v1 “Monthly Batched”

  • The Home Algo v1 was used in production for data generation prior to May 2020 and is now retired.
  • The Home Algo was run 1x / month on the 1st of each month.
  • At the start of each month, all pings for the previous 6 weeks were analyzed for each device. These pings were aggregated into clusters, and then filtered to only include clusters during nighttime hours (6pm - 7am local time). We identify the most common nighttime location for each device based on the frequency of clusters. The winning location census block group (CBG) was reported as the “home” for that device for the subsequent month.
    • We also recorded the number of unique hours and unique days for which pings were observed at the common nighttime location, and these numbers were used to form a “confidence score”. Only devices with a confidence score > a threshold were considered high confidence home locations, and only high confidence home locations were used in SafeGraph products.
    • Devices that do not have a high confidence home location were treated as if the home location is unknown.
  • Known Issues with Home Algo v1:
    • New devices entering the panel are not assigned a home until the first of the following month, and therefore all new devices across a month are added to the panel simultaneously. This created discontinuities at month boundaries.
    • Devices leaving the panel across the month caused the apparent sample size to slowly decay across the course of each month, and then regenerate at the start of the month suddenly.

Home Algo v2 “Incremental Updates”

  • Home Algo v2 went into production in May 2020 (see Rollout of Home Algo v2 below).
  • Each day:
    • All pings for all devices are clustered, filtered to only include clusters during nighttime hours (6pm - 7 am local time), and the frequency of clusters per unit space (e.g., 3 clusters in census block group A, 4 clusters in census block group B, etc.) are computed. The census block group with the most clusters is internally recorded as the daily “winner”. We also record how many pings are observed in all of the clusters at the winning location.
    • Then, any device that has not had a home location updated within 30 calendar days is “updated” by re-computing the common nighttime location (see next).
  • To compute a common nighttime location:
    • Lookback over the previous 6 weeks of daily “winning” common nighttime locations, and identify the most frequently “winning” common nighttime location. This is the new home location for the next 30 calendar days.
    • We also compute an internal “confidence score” based on the number of unique hours and unique days for which pings were observed at this home location.
    • Only devices with a confidence score > a threshold are considered high confidence home locations.
    • The new home location is recorded internally, along with its confidence rating, along with the date.
    • The new home location (or lack thereof) immediately takes effect for that device.
    • The home location for this device will not be re-computed for 30 calendar days.
    • Note: Only high confidence home locations are used in SafeGraph Patterns and Social Distancing Metrics products and reflected in the home_panel_summary.csv. Devices that do not have a high confidence home location are treated as if the home location is unknown.
  • New Devices. When a new device enters the panel, there is no historical data. A new device must accumulate at least 5 unique days of data (this may be > 5 calendar days if the device does not generate pings every day) before it is eligible to determine a high-confidence home. After 5 unique days of data are collected the home location will be computed, and it will not be recomputed for another 30 calendar days.

Rollout of Home Algo v2

Home Algo v1 was used until May 2020. Forward-facing data generation switched over to use Home Algo v2 on the following dates:

  • SDM v2.1: May 18th 2020. (Note SDM v2.1 began on 5/10/20 using Home Algo v1)
  • Monthly Places Patterns: May 2020 data (released in June 2020).
  • Weekly Places Patterns: Week of 5/04/20

Historical Data and Backfills

For historical backfills of data before May 2020, a one-time hybrid algorithm was used, rather than back-computing Home Algo v2.

The Hybrid Home Algo for Historical Backfills was applied to the following backfills:

  • The May 2020 Backfill of Weekly Places Patterns (Jan 1 2020 through May 2020)
  • The May 2020 Backfill of Monthly Patterns (Jan 1 2018 through May 2020)
  • The May 2020 Backfill of Social Distancing Metrics v2.1 (Jan 1 2019 - Dec 31 2019)
Why did SafeGraph not use Home Algo v2 for historical data?
  • Backfilling home algo v2 on the history of SafeGraph data had non trivial compute costs.
    The main shortcoming of Home Algo v1 is that it failed to identify the home locations of new devices in a timely manner, and added all new devices to the panel at the start of each month.
  • By retrospectively using data about a device “from the future” (e.g., 30 days following the first of the month), we are able to utilize the home location of a device as soon as it appears in the panel, alleviating the main problem with Home Algo v1.
  • However, during forward-facing data generation day to day and month to month, we cannot use data “from the future”, and decided that Home Algo v2 was the best solution for forward-facing data generation.

Hybrid Home Algo for Historical Backfills

  • Retrospectively, the home location for each device each month, N, was inferred by analyzing ~3 months of home data (based on Home Algo v1) for that device looking back two months (N -1 and N-2) as well as looking for all data from the current month (N). Any home location detected (using the Home Algo v1) during any of those three months (N-2, N-1, N) was recorded as the home location of that device for month N by the Hybrid Home Algo.
  • If the home locations across those three months conflicted, then the chronologically most recent high-confidence home location was assigned as the home location for month N.

During testing, internal SafeGraph analysis showed that the Hybrid Home Algo produced very similar results to a backfill of Home Algo v2 (on a limited timeline). Although they are not identical methodologies, we expect the data to be highly comparable and we don’t expect major discontinuities at the switchover between backfill and forward-facing data generation.

Column Ordering

Files are delivered in a joined format when more than one product is purchased, and the exact column ordering depends on the product combination. If column order matters to you, take heed and reference the breakdown of column orders below:

*only pertinent to files including closed POIs

†carrier_name is a premium column. Please Contact Sales for more details.

Core+Geometry+Patterns

[core_poi-geometry-patterns.csv]

placekey

safegraph_place_id

parent_safegraph_place_id

safegraph_brand_ids

location_name

brands

top_category

sub_category

naics_code

latitude

longitude

street_address

city

region

postal_code

open_hours

category_tags

*opened_on

*closed_on

*tracking_opened_since

*tracking_closed_since

polygon_wkt

polygon_class

building_height

enclosed

phone_number

is_synthetic

includes_parking_lot

iso_country_code

date_range_start

date_range_end

raw_visit_counts

raw_visitor_counts

visits_by_day

poi_cbg

visitor_home_cbgs

visitor_daytime_cbgs

visitor_work_cbgs

visitor_country_of_origin

distance_from_home

median_dwell

bucketed_dwell_times

related_same_day_brand

related_same_month_brand

popularity_by_hour

popularity_by_day

device_type

†carrier_name

Core+Geometry

[core_poi-geometry.csv]

placekey

safegraph_place_id

parent_safegraph_place_id

safegraph_brand_ids

location_name

brands

top_category

sub_category

naics_code

latitude

longitude

street_address

city

region

postal_code

open_hours

category_tags

*opened_on

*closed_on

*tracking_opened_since

*tracking_closed_since

polygon_wkt

polygon_class

building_height

enclosed

phone_number

is_synthetic

includes_parking_lot

iso_country_code

Core+Patterns

[core_poi-patterns.csv]

placekey

safegraph_place_id

parent_safegraph_place_id

location_name

safegraph_brand_ids

brands

top_category

sub_category

category_tags

naics_code

latitude

longitude

street_address

city

region

postal_code

iso_country_code

phone_number

open_hours

*opened_on

*closed_on

*tracking_opened_since

*tracking_closed_since

date_range_start

date_range_end

raw_visit_counts

raw_visitor_counts

visits_by_day

poi_cbg

visitor_home_cbgs

visitor_daytime_cbgs

visitor_work_cbgs

visitor_country_of_origin

distance_from_home

median_dwell

bucketed_dwell_times

related_same_day_brand

related_same_month_brand

popularity_by_hour

popularity_by_day

device_type

†carrier_name

Geometry+Patterns

[geometry-patterns.csv]

placekey

safegraph_place_id

parent_safegraph_place_id

loaction_name

safegraph_brand_ids

brands

latitude

longitude

street_address

city

region

postal_code

iso_country_code

polygon_wkt

polygon_class

building_height

enclosed

includes_parking_lot

is_synthetic

date_range_start

date_range_end

raw_visit_counts

raw_visitor_counts

visits_by_day

poi_cbg

visitor_home_cbgs

visitor_daytime_cbgs

visitor_work_cbgs

visitor_country_of_origin

distance_from_home

median_dwell

bucketed_dwell_times

related_same_day_brand

related_same_month_brand

popularity_by_hour

populairty_by_day

device_type

†carrier_name

Delivery Cadence and Directory Structure

Updated 8 days ago


Places Manual


This document provides details on attribute methodology and answers frequently asked questions (FAQs) about the nuances of the SafeGraph Places dataset.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.