Home SafeGraph Open Census Dataset
Post
Cancel

SafeGraph Open Census Dataset

들어가며

오늘은 SafeGraph Inc.에서 재가공 및 공개 배포 중인 Open Census Data를 살펴본 내용이다. 본 데이터는 사실 US Census Bureau에서 이미 배포 중인 각종 demographic 데이터들을 하나로 종합해 정리한 데이터라고 보면 된다. 지역별 인구수, 소득 수준 같은 속성 뿐 아니라, 특정 시설수 같은 정보도 포함되어 있다.


US Census Bureau(미국 인구조사국) 는 다양한 특성에 대한 다양한 인구통계적(demographic) 정보를 조사 및 수집하는 기관이다. 대표적인 ‘국내 인구수 조사(Census)’도 이런 여러 demographic data 중 하나이다. Census 조사는 보통 대규모의 인력과 비용, 그리고 시간이 요구되기 때문에, 미국같은 경우는 10년마다 모집단 조사(full survey)를 수행하고 결과를 발표한다. 이게 이른바, Decennial Census라고 부르는 공식 결과다. 그리고 미국 인구조사국은 이 외에도 지역마다의 성별수(Gender), 나이(Age), 소득(Income), 민족계통(Ethnicity; 라틴아메리카(히스패닉계) or 아시아계 or 아프리카계 or 유럽계) 등을 조사한다. 이 조사는 American Community Survey; ACS라는 프로젝트 이름으로 매년 샘플링 조사(sample survey)를 통해 집계하여 결과를 발표한다.

SafeGraph - Open Census Data

1
2
3
4
5
6
7
8
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gdown
from tqdm import tqdm
import tarfile
import geopandas as gpd


Data Acquisition

현재 배포되고 있는 데이터들의 Google Drive 고유파일 ID파일이름들을 정리해뒀다.

1
2
3
4
5
6
7
8
9
# File ID and Name
IDnFN = [['1klKXB35iXyhfbgKTEXZXgdhZWJDwpbEi', 'safegraph_open_census_data_2020.tar.gz'], \
    ['1v2MTZG9MNW-ao8fSeO6r69AfHsXB2mwx', 'safegraph_open_census_data_2020_to_2029_geometry.tar.gz'], \
    ['1rMF7doWkgoKvAs4GPi5FpQNhOFd9V9b2', 'safegraph_open_census_data_2020_redistricting.tar.gz'], \
    ['1ab-dGzzDntCEE8wVBekAQpvsOZvxSbVV', 'safegraph_open_census_data_2019.tar.gz'], \
    ['1oUT_UBUCa6nRZ207taXHAeANmpHEwsGt', 'safegraph_open_census_data_2018.tar.gz'], \
    ['15TFKFONZquET0AvpFlsENSOP2dk2w39V', 'safegraph_open_census_data_2017.tar.gz'], \
    ['10InSSafTPUZ6tK-e8g6msYepCO9i2H3L', 'safegraph_open_census_data_2016.tar.gz'], \
    ['1QmKe7v7peaYAjDDh50hNP4b9s0B0JTWm', 'safegraph_open_census_data_2010_to_2019_geometry.tar.gz']]
1
2
3
4
gdrive_base_path = 'https://drive.google.com/uc?id='
SavePath = '/open_census_data'
for file_id, file_name in tqdm(IDnFN):
    gdown.download(gdrive_base_path + file_id, os.path.join(SavePath, file_name), quiet=True)
1
100%|██████████| 8/8 [04:37<00:00, 34.67s/it]
1
2
3
4
5
6
7
# unzip 'safegraph_open_census_data_2020.tar.gz'
with tarfile.open(os.path.join(SavePath, IDnFN[0][1]), 'r:gz') as tr:
    tr.extractall(path=SavePath)

# unzip 'safegraph_open_census_data_2020_to_2029_geometry.tar.gz' to extract a geojson file of 'cbg_2020.geojson'
with tarfile.open(os.path.join(SavePath, IDnFN[1][1]), 'r:gz') as tr:
    tr.extractall(path=SavePath)


US Census GeoJson

US Census Bureau의 집계 단위인 Census Block Group(cbg)의 polygon-styled and geometrical GeoJSON 파일이다. 용량이 커서(~1.9 GB) 불러오는데 꽤 시간이 소요된다. Polygon-style로 시각화 할 게 아니면, 각 census data 내의 ‘/metadata/cbg_geographic_data.csv’를 사용하자.

1
2
3
4
5
BasePath = '/open_census_data'
FileContents = os.listdir(BasePath)

cbg_geo = gpd.read_file(os.path.join(BasePath, FileContents[1]))
cbg_geo


StateFIPSCountyFIPSTractCodeBlockGroupCensusBlockGroupStateCountyMTFCCgeometry
0010330202001010330202001ALColbert CountyG5030MULTIPOLYGON (((-87.70081 34.76189, -87.70081 ...
1010199560001010199560001ALCherokee CountyG5030MULTIPOLYGON (((-85.67917 34.15255, -85.67904 ...
2010730047012010730047012ALJefferson CountyG5030MULTIPOLYGON (((-86.78478 33.51157, -86.78267 ...
3010730047021010730047021ALJefferson CountyG5030MULTIPOLYGON (((-86.77400 33.51790, -86.77396 ...
4010730047022010730047022ALJefferson CountyG5030MULTIPOLYGON (((-86.77621 33.50359, -86.77599 ...
..............................
242330721270083003721270083003PRSan Juan MunicipioG5030MULTIPOLYGON (((-66.09123 18.39897, -66.08954 ...
242331721270100123721270100123PRSan Juan MunicipioG5030MULTIPOLYGON (((-66.04081 18.36705, -66.04076 ...
242332721270100222721270100222PRSan Juan MunicipioG5030MULTIPOLYGON (((-66.05515 18.37903, -66.05482 ...
242333721270101001721270101001PRSan Juan MunicipioG5030MULTIPOLYGON (((-66.07215 18.34087, -66.07208 ...
242334721270101003721270101003PRSan Juan MunicipioG5030MULTIPOLYGON (((-66.08014 18.32918, -66.08002 ...

242335 rows × 9 columns


1
2
3
4
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
cbg_geo.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=0.2)
ax.axis('off')
plt.show()



png


이 글에선 미국 본토 내 “48개 주 + District of Columbia; 즉 워싱턴 D.C“만을 다룬다.
한 가지 TMI,,, 미국 자체는 크게 3계층 구조를 지닌다고 한다.

  1. United States: 50개 states + 워싱턴 D.C (미국 수도인 워싱턴 D.C는 어느 주에도 속하지 않음)
  2. Continental United States: ‘하와이주’를 제외한 49개 states + 워싱턴 D.C
  3. Conterminous/Contiguous United States: ‘하와이주’와 ‘알래스카주’를 제외한 48개 states + 워싱턴 D.C

즉, Conterminous(Contiguous) United States 만 다루겠다는 말이다.
두 번째 TMI,,, ‘하와이주’와 ‘알래스카주’를 제외한 48개 states를 Lower 48 states라고 부르기도 한다고 한다.

1
2
# 하와이주(HI), 알래스카주(AK) + 푸에르토리코(PR)까지 총 3개 제외
us_cbg_geo = cbg_geo[~cbg_geo['State'].isin(['AK', 'HI', 'PR'])].reset_index(drop=True)
1
2
3
4
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_cbg_geo.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=0.2)
ax.axis('off')
plt.show()


png

US Open Census Data

2023년 6월 기준, SafeGraph에서 재가공-배포 중인 demographic dataset은 다음과 같다.

  • 2016 5-year ACS : 2012.01 ~ 2016.12까지의 매해 ACS의 결과를 평균 집계한 데이터
  • 2017 5-year ACS : 2013.01 ~ 2017.12 ACS 평균 집계
  • 2018 5-year ACS : 2014.01 ~ 2018.12 ACS 평균 집계
  • 2019 5-year ACS : 2015.01 ~ 2019.12 ACS 평균 집계
  • 2010-2019 Census Block Group geometries : 2010년 ~ 2019년 사이 데이터들의 집계 기준으로 활용한 GeoJSON
  • (NEW) 2020 5-year ACS : 2016.01 ~ 2020.12
  • (NEW) 2020-2029 Census Block Group geometries : 2020년 ~ 2029년 사이 데이터들의 집계 기준으로 활용할 GeoJSON
  • (NEW) 2020 decennial redistricting data : 2020년판 Decennial Survey (미국 인구총조사 발표 데이터; 인구수에 대한 데이터만 있음; ACS 아님)

참고로, 엄밀히 말하자면, 5년 묶음으로 취합 및 집계한 이 데이터들(Multiyear dataset)도 US Census Bureau 측에서 ACS 일환으로 수행한 자료이다. 자세한 내용이 궁금하다면 아래 미국 인구조사국 공식 홈페이지 내용을 참고하자.


https://www.census.gov/programs-surveys/acs/guidance/estimates.html
: ‘When to Use 1-year or 5-year Estimates’ by US Census Bureau
https://www2.census.gov/programs-surveys/acs/tech_docs/accuracy/MultiyearACSAccuracyofData2019.pdf
: ‘Detailed of multiyear(5-year) Dataset’ by US Census Bureau


그러면 SafeGraph Inc. 측 홈페이지에 데이터 업로드의 취지와 목적이 궁금할 수 있는데, 그들은 아래와 같이 설명하고 있다.

“While the US Census Bureau offers free downloads of their data, it’s often difficult and confusing to get bulk access to it at the granularity needed for advanced analysis.
(Therefore) We’ve pre-cleaned this data and packaged it into easy to use…“
- SafeGraph Inc. (https://www.safegraph.com/free-data/open-census-data).


대충 정리하자면, 미국 인구 조사국에서 올려놓은 ACS 각종 자료들이 여기저기 흩어져있고 사용자들의 접근과 활용이 어려우니, 더욱 사용이 용이하게끔 우리가 잘 정리해서 재배포한다는 취지이다. 아무튼 그렇다. 아무튼 이 글에서 나는 SafeGraph’s <2020 5-year ACS> 데이터(아래 Dataset Structure 참고)를 살펴보도록 한다.

1
2
3
4
5
6
7
8
9
10
11
safegraph_open_census_data_2020
├── data
│   ├── cbg_b01.csv     # field 명으로 데이터들이 나눠져있다. (field: household income, median age, population etc...)
│   ├── cbg_b02.csv
│   ├── ...
│   ├── ...
│   └── cbg_c24.csv
└── metadata
    ├── cbg_field_descriptions.csv          # field 들에 대한 설명
    ├── cbg_fips_codes.csv                  # us fips code
    └── cbg_geographic_data.csv             # point-styled CBG geometry (longitude and latitude)

데이터에 포함된 모든 속성의 테이블 정의서는 여기 미국 인구조사국 사이트에서 열람할 수 있다.

1
2
3
4
BasePath = '/open_census_data/safegraph_open_census_data_2020'
SubDir = os.listdir(BasePath)
cbg_fd_desc = pd.read_csv(os.path.join(BasePath, SubDir[1], 'cbg_field_descriptions.csv'))
cbg_fd_desc.head()


table_idtable_numbertable_titletable_topicstable_universefield_level_1field_level_2field_level_3field_level_4field_level_5field_level_6field_level_7field_level_8field_level_9field_level_10
0B01001e1B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalNaNNaNNaNNaNNaNNaN
1B01001e10B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale22 to 24 yearsNaNNaNNaNNaN
2B01001e11B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale25 to 29 yearsNaNNaNNaNNaN
3B01001e12B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale30 to 34 yearsNaNNaNNaNNaN
4B01001e13B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale35 to 39 yearsNaNNaNNaNNaN


1
2
3
4
# field_level_1 에는 크게 'Estimate'과 'MarginOfError'에 해당하는 field가 있다. 
# 이 글에선 측정치(값) 자체만 보고자 하므로 'Estimate'으로만 관심 field 수를 제한하겠다.
cbg_fd_desc = cbg_fd_desc[cbg_fd_desc['field_level_1']=='Estimate'].reset_index(drop=True)
cbg_fd_desc


table_idtable_numbertable_titletable_topicstable_universefield_level_1field_level_2field_level_3field_level_4field_level_5field_level_6field_level_7field_level_8field_level_9field_level_10
0B01001e1B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalNaNNaNNaNNaNNaNNaN
1B01001e10B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale22 to 24 yearsNaNNaNNaNNaN
2B01001e11B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale25 to 29 yearsNaNNaNNaNNaN
3B01001e12B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale30 to 34 yearsNaNNaNNaNNaN
4B01001e13B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMale35 to 39 yearsNaNNaNNaNNaN
................................................
4077C24030e55C24030Sex By Industry For The Civilian Employed Popu...Age and Sex, Civilian Population, IndustryCivilian employed population 16 years and overEstimateSEX BY INDUSTRY FOR THE CIVILIAN EMPLOYED POPU...Civilian employed population 16 years and overTotalFemalePublic administrationNaNNaNNaNNaN
4078C24030e6C24030Sex By Industry For The Civilian Employed Popu...Age and Sex, Civilian Population, IndustryCivilian employed population 16 years and overEstimateSEX BY INDUSTRY FOR THE CIVILIAN EMPLOYED POPU...Civilian employed population 16 years and overTotalMaleConstructionNaNNaNNaNNaN
4079C24030e7C24030Sex By Industry For The Civilian Employed Popu...Age and Sex, Civilian Population, IndustryCivilian employed population 16 years and overEstimateSEX BY INDUSTRY FOR THE CIVILIAN EMPLOYED POPU...Civilian employed population 16 years and overTotalMaleManufacturingNaNNaNNaNNaN
4080C24030e8C24030Sex By Industry For The Civilian Employed Popu...Age and Sex, Civilian Population, IndustryCivilian employed population 16 years and overEstimateSEX BY INDUSTRY FOR THE CIVILIAN EMPLOYED POPU...Civilian employed population 16 years and overTotalMaleWholesale tradeNaNNaNNaNNaN
4081C24030e9C24030Sex By Industry For The Civilian Employed Popu...Age and Sex, Civilian Population, IndustryCivilian employed population 16 years and overEstimateSEX BY INDUSTRY FOR THE CIVILIAN EMPLOYED POPU...Civilian employed population 16 years and overTotalMaleRetail tradeNaNNaNNaNNaN

4082 rows × 15 columns


Alternative for US and CBG Geometry

CBG Geometries GeoJSON 파일은 다루기 너무 무거워서, State-level의 다른 shapefiles(cb_2018_us_state_500k)을 찾아 사용하였다. 아래 URL 참조.
https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

1
2
3
4
# Load Point-styled CBG geometry
cbg_geo_lonlat = pd.read_csv(os.path.join(BasePath, SubDir[1], 'cbg_geographic_data.csv'))
# cbg_geo_lonlat['census_block_group'] = cbg_geo_lonlat['census_block_group'].apply(lambda x: f"{x:012d}") # 자릿수맞추기: CBG 코드는 12글자
cbg_geo_lonlat.head()


census_block_groupamount_landamount_waterlatitudelongitude
01001020100142642992843532.465832-86.489661
1100102010025561005032.485873-86.489672
2100102020012058374032.480082-86.474974
3100102020021262444566932.464435-86.469766
4100102030013866513905432.480175-86.460792


1
2
3
4
5
6
7
us_states = gpd.read_file('/open_census_data/cb_2018_us_state_500k')
us_states = us_states[~us_states['STUSPS'].isin(['PR', 'AK', 'HI', 'AS', 'VI', 'GU', 'MP'])]

fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_states.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=.5)
ax.axis('off')
plt.show()


png


Gender Population for each CBG

  • table_id(male) = B01001e2
  • table_id(female) = B01001e26
1
cbg_fd_desc[cbg_fd_desc['table_id'].isin(['B01001e2', 'B01001e26'])]


table_idtable_numbertable_titletable_topicstable_universefield_level_1field_level_2field_level_3field_level_4field_level_5field_level_6field_level_7field_level_8field_level_9field_level_10
11B01001e2B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalMaleNaNNaNNaNNaNNaN
18B01001e26B01001Sex By AgeAge and SexTotal populationEstimateSEX BY AGETotal populationTotalFemaleNaNNaNNaNNaNNaN


1
2
3
4
5
# table_number 앞 세자리를 따서 /data에서 파일을 찾아 불러온다.
BasePath = '/open_census_data/safegraph_open_census_data_2020'
print(os.path.join(BasePath, SubDir[0]))
cbg_b01 = pd.read_csv(os.path.join(BasePath, SubDir[0], 'cbg_b01.csv'))
cbg_b01.head()
1
/open_census_data/safegraph_open_census_data_2020/data


census_block_groupB01001e1B01001m1B01001e2B01001m2B01001e3B01001m3B01001e4B01001m4B01001e5...B01002He3B01002Hm3B01002Ie1B01002Im1B01002Ie2B01002Im2B01002Ie3B01002Im3B01003e1B01003m1
01001020100167419228488121517235...37.527.2NaNNaNNaNNaNNaNNaN674192
11001020100212674016942444965808845...35.05.6NaNNaNNaNNaNNaNNaN1267401
2100102020017062003541425038726915...28.84.265.837.6NaNNaN65.837.6706200
31001020200210512296561753126597...42.310.0NaNNaNNaNNaNNaNNaN1051229
41001020300129125651461289222313469200...35.72.828.022.127.60.528.958.72912565

5 rows × 161 columns


1
2
3
4
5
6
cbg_b01 = pd.merge(cbg_b01, cbg_geo_lonlat, on='census_block_group')
cbg_b01 = gpd.GeoDataFrame(cbg_b01, geometry=gpd.points_from_xy(cbg_b01.longitude, cbg_b01.latitude))
cbg_b01 = cbg_b01.set_crs(epsg=4269) # The EPSG of 'cb_2018_us_state_500k' is 4269. But note that EPSG of CBG_geojson is 4326.

# gpd.sjoin(how='left/right/inner/'): ‘inner’: use intersection of keys from both dfs; retain only left_df geometry column
cbg_b01 = gpd.sjoin(cbg_b01, us_states[['NAME', 'geometry']], how='inner') # Spatial Join based on the Lower 48 states 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def alpha_with_SquareMinMaxScaling(values, min_alpha, max_alpha):
    """
    min_alpha : int or float
        minimum alpha for color
    max_alpha : int or float
        maximum alpha for color

    Returns
    -------
    Sequential list
        the alpha list for each value, the order must be controlled carefully.
    """
    alphas = []
    min_val = np.min(values)
    max_val = np.max(values)
    for v in values:
        alp = min_alpha + (max_alpha - min_alpha) * ((v - min_val) / (max_val - min_val)) ** 2
        alphas.append(alp)

    return alphas
1
2
3
4
5
6
7
8
9
cbg_gender_pop = cbg_b01.loc[:, ['census_block_group', 'longitude', 'latitude', 'B01001e2', 'B01001e26']].rename(columns={'B01001e2':'male', 'B01001e26':'female'})

# Only Male
male_alps = alpha_with_SquareMinMaxScaling(cbg_gender_pop['male'].values, 0.01, 0.85)
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_states.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=.5)
ax.scatter(cbg_gender_pop['longitude'], cbg_gender_pop['latitude'], s=15, c='blue', alpha=male_alps)
ax.axis('off')
plt.show()


png

1
2
3
4
5
6
7
# Only Female
female_alps = alpha_with_SquareMinMaxScaling(cbg_gender_pop['female'].values, 0.01, 0.85)
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_states.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=.5)
ax.scatter(cbg_gender_pop['longitude'], cbg_gender_pop['latitude'], s=15, c='red', alpha=female_alps)
ax.axis('off')
plt.show()


png

The Number of House by Housing Value for each CBG

부동산 가치가 높은 지역과 낮은 지역이 어디인지 살펴본다. 임의의 금액 기준을 두고, 해당하는 field 인덱스들을 찾아 분류했다.

  • $500,000 이상 table_id: B25075e23, B25075e24, B25075e25, B25075e26, B25075e27
  • $50,000 미만 table_id: B25075e2, B25075e3, B25075e4, B25075e5, B25075e6, B25075e7, B25075e8, B25075e9
1
2
cntHouse_fd_desc = cbg_fd_desc[cbg_fd_desc['table_title']=='Value'].reset_index(drop=True)
cntHouse_fd_desc


table_idtable_numbertable_titletable_topicstable_universefield_level_1field_level_2field_level_3field_level_4field_level_5field_level_6field_level_7field_level_8field_level_9field_level_10
0B25075e1B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotalNaNNaNNaNNaNNaNNaN
1B25075e10B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$50 000 to $59 999NaNNaNNaNNaNNaN
2B25075e11B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$60 000 to $69 999NaNNaNNaNNaNNaN
3B25075e12B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$70 000 to $79 999NaNNaNNaNNaNNaN
4B25075e13B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$80 000 to $89 999NaNNaNNaNNaNNaN
5B25075e14B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$90 000 to $99 999NaNNaNNaNNaNNaN
6B25075e15B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$100 000 to $124 999NaNNaNNaNNaNNaN
7B25075e16B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$125 000 to $149 999NaNNaNNaNNaNNaN
8B25075e17B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$150 000 to $174 999NaNNaNNaNNaNNaN
9B25075e18B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$175 000 to $199 999NaNNaNNaNNaNNaN
10B25075e19B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$200 000 to $249 999NaNNaNNaNNaNNaN
11B25075e2B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotalLess than $10 000NaNNaNNaNNaNNaN
12B25075e20B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$250 000 to $299 999NaNNaNNaNNaNNaN
13B25075e21B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$300 000 to $399 999NaNNaNNaNNaNNaN
14B25075e22B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$400 000 to $499 999NaNNaNNaNNaNNaN
15B25075e23B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$500 000 to $749 999NaNNaNNaNNaNNaN
16B25075e24B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$750 000 to $999 999NaNNaNNaNNaNNaN
17B25075e25B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$1 000 000 to $1 499 999NaNNaNNaNNaNNaN
18B25075e26B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$1 500 000 to $1 999 999NaNNaNNaNNaNNaN
19B25075e27B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$2 000 000 or moreNaNNaNNaNNaNNaN
20B25075e3B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$10 000 to $14 999NaNNaNNaNNaNNaN
21B25075e4B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$15 000 to $19 999NaNNaNNaNNaNNaN
22B25075e5B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$20 000 to $24 999NaNNaNNaNNaNNaN
23B25075e6B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$25 000 to $29 999NaNNaNNaNNaNNaN
24B25075e7B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$30 000 to $34 999NaNNaNNaNNaNNaN
25B25075e8B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$35 000 to $39 999NaNNaNNaNNaNNaN
26B25075e9B25075ValueHousing Value and Purchase Price, Owner Renter...Owner-occupied housing unitsEstimateVALUEOwner-occupied housing unitsTotal$40 000 to $49 999NaNNaNNaNNaNNaN


1
2
3
4
5
# table_number 앞 세자리를 따서 /data에서 파일을 찾아 불러온다.
BasePath = '/open_census_data/safegraph_open_census_data_2020'
print(os.path.join(BasePath, SubDir[0]))
cbg_b25 = pd.read_csv(os.path.join(BasePath, SubDir[0], 'cbg_b25.csv'))
cbg_b25.head()
1
/open_census_data/safegraph_open_census_data_2020/data


census_block_groupB25001e1B25001m1B25002e1B25002m1B25002e2B25002m2B25002e3B25002m3B25003e1...B25093e25B25093m25B25093e26B25093m26B25093e27B25093m27B25093e28B25093m28B25093e29B25093m29
010010201001290772907729077012290...78183115231221012
1100102010024201164201164031131727403...0124601236012
2100102020012845728457227535749227...351117012012012
3100102020024368343683346869052346...171701201224012
410010203001114718511471851034185113891034...01247582226012

5 rows × 1741 columns


1
2
3
4
cbg_b25 = pd.merge(cbg_b25, cbg_geo_lonlat, on='census_block_group')
cbg_b25 = gpd.GeoDataFrame(cbg_b25, geometry=gpd.points_from_xy(cbg_b25.longitude, cbg_b25.latitude))
cbg_b25 = cbg_b25.set_crs(epsg=4269)
cbg_b25 = gpd.sjoin(cbg_b25, us_states[['NAME', 'geometry']], how='inner').reset_index(drop=True) # Spatial Join based on the Lower 48 states 
1
cbg_b25.head()


census_block_groupB25001e1B25001m1B25002e1B25002m1B25002e2B25002m2B25002e3B25002m3B25003e1...B25093m28B25093e29B25093m29amount_landamount_waterlatitudelongitudegeometryindex_rightNAME
010010201001290772907729077012290...2101242642992843532.465832-86.489661POINT (-86.48966 32.46583)17Alabama
1100102010024201164201164031131727403...60125561005032.485873-86.489672POINT (-86.48967 32.48587)17Alabama
2100102020012845728457227535749227...120122058374032.480082-86.474974POINT (-86.47497 32.48008)17Alabama
3100102020024368343683346869052346...40121262444566932.464435-86.469766POINT (-86.46977 32.46444)17Alabama
410010203001114718511471851034185113891034...260123866513905432.480175-86.460792POINT (-86.46079 32.48018)17Alabama

5 rows × 1748 columns


1
2
HighMortValue = 'B25075e23, B25075e24, B25075e25, B25075e26, B25075e27'.split(', ')
LowMortValue = 'B25075e2, B25075e3, B25075e4, B25075e5, B25075e6, B25075e7, B25075e8, B25075e9'.split(', ') 
1
2
3
cbg_cntHouse_value = cbg_b25.loc[:, ['census_block_group', 'latitude', 'longitude']]
cbg_cntHouse_value['HighHouseCnt'] = cbg_b25.loc[:, HighMortValue].sum(axis=1)
cbg_cntHouse_value['LowHouseCnt'] = cbg_b25.loc[:, LowMortValue].sum(axis=1)
1
2
3
4
5
6
7
# The number of House with high value (more than $500,000)
high_alps = alpha_with_SquareMinMaxScaling(cbg_cntHouse_value['HighHouseCnt'].values, 0, 0.85)
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_states.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=.5)
ax.scatter(cbg_cntHouse_value['longitude'], cbg_cntHouse_value['latitude'], s=15, c='blue', alpha=high_alps)
ax.axis('off')
plt.show()



png

1
2
3
4
5
6
7
# The number of House with low value (lower than $50,000)
low_alps = alpha_with_SquareMinMaxScaling(cbg_cntHouse_value['LowHouseCnt'].values, 0, 0.85)
fig, ax = plt.subplots(facecolor='w', figsize=(15, 15))
us_states.plot(ax=ax, facecolor='None', edgecolor='black', linewidth=.5)
ax.scatter(cbg_cntHouse_value['longitude'], cbg_cntHouse_value['latitude'], s=15, c='red', alpha=low_alps)
ax.axis('off')
plt.show()


png

Educational Attainment for only US citizens 18 years and over

교육 수준이 높고, 낮은 인구가 미국 어디에 몰려있는지 살펴본다.

  • 초중고 중퇴 및 고졸 table_id: B29002e2, B29002e3, B29002e4
  • 전문대 졸(Associate’s degree), 일반대 졸(Bachelor’s degree) 및 석사이상 table_id: B29002e6, B29002e7, B29002e8
1
2
3
4
5
# table_number 앞 세자리를 따서 /data에서 파일을 찾아 불러온다.
BasePath = '/open_census_data/safegraph_open_census_data_2020'
print(os.path.join(BasePath, SubDir[0]))
cbg_b29 = pd.read_csv(os.path.join(BasePath, SubDir[0], 'cbg_b29.csv'))
cbg_b29.head()
1
/open_census_data/safegraph_open_census_data_2020/data


census_block_groupB29001e1B29001m1B29001e2B29001m2B29001e3B29001m3B29001e4B29001m4B29001e5...B29002e8B29002m8B29003e1B29003m1B29003e2B29003m2B29003e3B29003m3B29004e1B29004m1
010010201001574161102811207922677126...5535574161726350216439167.020140.0
11001020100294825616380298130322134165...76349482561007084824570699.011633.0
21001020200145812189521436710340123...794581211066635210839750.020003.0
3100102020029742112891432238330173161...3727762201393772320450221.03210.0
4100102030012045413317131730234680221318...17298204541317093187541266843.010424.0

5 rows × 35 columns


1
2
target_table = 'B29002e2, B29002e3, B29002e4, B29002e6, B29002e7, B29002e8'.split(', ')
target_table
1
['B29002e2', 'B29002e3', 'B29002e4', 'B29002e6', 'B29002e7', 'B29002e8']
1
cbg_fd_desc[cbg_fd_desc['table_id'].isin(target_table)].reset_index(drop=True)


table_idtable_numbertable_titletable_topicstable_universefield_level_1field_level_2field_level_3field_level_4field_level_5field_level_6field_level_7field_level_8field_level_9field_level_10
0B29002e2B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotalLess than 9th gradeNaNNaNNaNNaNNaN
1B29002e3B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotal9th to 12th grade no diplomaNaNNaNNaNNaNNaN
2B29002e4B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotalHigh school graduate (includes equivalency)NaNNaNNaNNaNNaN
3B29002e6B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotalAssociate's degreeNaNNaNNaNNaNNaN
4B29002e7B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotalBachelor's degreeNaNNaNNaNNaNNaN
5B29002e8B29002Citizen, Voting-Age Population By Educational ...Age and Sex, Citizenship, Educational AttainmentCitizens 18 years and overEstimateCITIZEN, VOTING-AGE POPULATION BY EDUCATIONAL ...Citizens 18 years and overTotalGraduate or professional degreeNaNNaNNaNNaNNaN


1
2
3
4
cbg_b29 = pd.merge(cbg_b29, cbg_geo_lonlat, on='census_block_group')
cbg_b29 = gpd.GeoDataFrame(cbg_b29, geometry=gpd.points_from_xy(cbg_b29.longitude, cbg_b29.latitude))
cbg_b29 = cbg_b29.set_crs(epsg=4269)
cbg_b29 = gpd.sjoin(cbg_b29, us_states[['NAME', 'geometry']], how='inner').reset_index(drop=True) # Spatial Join based on the Lower 48 states 
1
2
3
4
5
6
HighEdu = 'B29002e6, B29002e7, B29002e8'.split(', ')
LowEdu = 'B29002e2, B29002e3, B29002e4'.split(', ')

cbg_cntPop_edu = cbg_b29.loc[:, ['census_block_group', 'latitude', 'longitude']]
cbg_cntPop_edu['HighEduPop'] = cbg_b29.loc[:, HighEdu].sum(axis=1)
cbg_cntPop_edu['LowEduPop'] = cbg_b29.loc[:, LowEdu].sum(axis=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# The number of House with high value (more than $500,000)
high_alps = alpha_with_SquareMinMaxScaling(cbg_cntPop_edu['HighEduPop'].values, 0, 0.85)
low_alps = alpha_with_SquareMinMaxScaling(cbg_cntPop_edu['LowEduPop'].values, 0, 0.85)

fig, axs = plt.subplots(nrows=1, ncols=2, facecolor='w', figsize=(15, 15))
us_states.plot(ax=axs[0], facecolor='None', edgecolor='black', linewidth=.5)
us_states.plot(ax=axs[1], facecolor='None', edgecolor='black', linewidth=.5)

axs[0].scatter(cbg_cntPop_edu['longitude'], cbg_cntPop_edu['latitude'], s=15, c='blue', alpha=high_alps)
axs[1].scatter(cbg_cntPop_edu['longitude'], cbg_cntPop_edu['latitude'], s=15, c='red', alpha=low_alps)

axs[0].axis('off')
axs[1].axis('off')
fig.subplots_adjust(wspace=.1)
plt.show()


png

Take-Home Message and Discussion

  • SafeGraph Inc.의 데이터 중 Open Census Data란 것을 살펴보았다.
  • Census Block Group(CBG)라는 공간 스케일을 사용하고 있다.
  • US Census Bureau가 매년 조사를 수행하는 American Community Survey(ACS)자료를 기반으로 한 데이터이다.
  • 사용자로 하여금 ACS 자료 활용이 용이하게끔 하자는 것이 SafeGraph’s Open Census Data의 제작 취지이다.
  • 인구수 뿐 아니라 지역별 소득수준, 교육수준 등을 추정할 수 있는 다양한 정보들이 포함되어 있다.
  • SafeGraph Inc.는 이 외에도 카드소비데이터 - ‘Spend’ 데이터, 전세계 매장정보 - ‘Places’ 데이터를 배포하고 있다. 하지만 매우 안타깝게도 해당 데이터 접근은 유료 구독형 서비스라 이 글에선 다루지 못하였다…

fin

This post is licensed under CC BY 4.0 by the author.