US city + state geocoding, without a heavy webservice. With Who's On First and SQLite.
Sometimes you've got "location" fields that contain a weird mix of cities and states. Stuff like:
SF
Los Angeles, CA
Boston
California
bellingham washington
NYC
tuscaloosa AL
big apple
To make use of these, they generally need to be linked against some kind of canonical set of geographic entities. One approach is to throw them at a commercial geocoder like Google or Mapbox, but this is slow and expensive, and there are often onerous terms-of-service restrictions on the results. And, really, a full-blown geocoder is overkill here, since these kinds of location fields almost never contain street addresses, just references to a smaller set of high-level locations.
Litecoder is a small library that links these kinds of free-text location strings to records in the Who's On First (WOF) gazetteer from Mapzen, which includes both high-quality geographic metadata as well as IDs for corresponding records in a number of other gazetteers and knowledge databases (Wikipedia, Wikidata, DBpedia, Geonames, etc). Mapzen sadly doesn't exist anymore, but the WOF data is CC-0.
For now, Litecoder only supports US cities and states.
- Be fast. Lookups take ~20µs.
- Work anywhere without hassle. The underlying data ships with the package and is small enough to fit in memory (~100m). Since everything sits in RAM, the library can be used in ETL and big data workflows involving billions of inputs.
- Comprehensive support for nicknames and abbreviations. Eg,
Windy City
always means Chicago. - Some heuristics are unavoidable - eg,
Boston
should map toBoston, MA
, notBoston, GA
(which exists!). In these cases, do something simple and easy to reason about.
- Match locations embedded inside of surrounding text. For now, the assumption is that you've got a snippet of text that represents a location, and the goal is to figure out which one.
- Locations more granular than cities / towns - major parks, venues, etc.
- International cities + countries.
If you would like to install the development version, please run the following commands:
sudo apt-get install graphviz libgraphviz-dev pkg-config
git clone https://github.com/social-machines/litecoder.git
cd litecoder
pipenv install
Otherwise, you can download litecoder
using pip
pip install litecoder
from litecoder.usa import USCityIndex
# Load the pre-built index.
idx = USCityIndex()
idx.load()
# Basic city, state, country.
idx['Boston, Massachusetts']
idx['Boston, MA']
idx['Boston, MA, USA']
>> [CityMatch<Boston, Massachusetts, United States, wof:85950361>]
# Normalize differences in capitalization, spacing, commas.
idx['boston, ma']
idx['boston ma']
idx[' boston ma ']
idx['BOSTON MA']
>> [CityMatch<Boston, Massachusetts, United States, wof:85950361>]
# For major cities, match the "bare" city name.
idx['Boston']
>> [CityMatch<Boston, Massachusetts, United States, wof:85950361>]
# Since "Boston" alone (almost) never refers to Boston, GA!
idx['Boston, GA']
>> [CityMatch<Boston, Georgia, United States, wof:85936819>]
# But don't guess when there isn't a clear "major" city...
idx['Springfield']
>> []
# ... Until more detail is provided.
idx['Springfield, IL']
>> [CityMatch<Springfield, Illinois, United States, wof:85940429>]
# Match major abbreviations, alternate names, nicknames.
idx['NYC']
idx['New York City']
idx['Big Apple']
idx['Nueva York']
>> [CityMatch<New York, New York, United States, wof:85977539>]
from litecoder.usa import USStateIndex
# Load the pre-built index.
idx = USStateIndex()
idx.load()
# Basic state, country.
idx['Massachusetts']
idx['Massachusetts, USA']
>> [StateMatch<Massachusetts, United States, wof:85688645>]
Litecoder supports a memory mapped index load:
from litecoder.usa import USCityIndex
idx = USCityIndex()
idx.load(mmap=True)
When enabled, the index will share memory with any other index in any
other process on the machine also loaded with mmap=True
. This is
useful in parallel processing contexts where you may want a litecoder
instances per process, but the sum total of the memory usage would
exceed the machine resources. In addition to the memory efficiency,
memory mapping a file is faster than mmap=False
, which is the
default and copies the entire database into memory. The trade-off is
that initial accesses to the database with mmap=True
may be slower
as the operating system lazily pages data in as needed.
The city and state indexes return "match" objects that hold all metadata associated with the location, as well as denormalized copies of parent entities.
idx = USCityIndex()
idx.load()
sf = idx['San Francisco'][0]
sf['name']
>> 'San Francisco'
sf['population']
>> 805235
sf['latitude']
>> 37.759715
sf['longitude']
>> -122.693976
sf['region'].['name_abbr']
>> 'CA'
sf
>>
{'area_m2': 600307527.980684,
'country_iso': 'US',
'dbp_id': 'San_Francisco',
'duplicate': False,
'elevation': 16,
'fb_id': 'en.san_francisco',
'fct_id': '08cb9cb0-8f76-11e1-848f-cfd5bf3ef515',
'fips_code': '667000',
'gn_id': 5391959,
'gp_id': 2487956,
'latitude': 37.759715,
'loc_id': 'n79018452',
'longitude': -122.693976,
'name': 'San Francisco',
'name_a0': 'United States',
'name_a1': 'California',
'nyt_id': '9223372036854775807',
'population': 805235,
'qs_id': 240388,
'qs_pg_id': 240388,
'region': {'area_m2': 423822167986.13293,
'country_iso': 'US',
'fips_code': 'US06',
'gn_id': 5332921,
'gp_id': 2347563,
'hasc_id': 'US.CA',
'iso_id': 'US-CA',
'latitude': 37.215297,
'longitude': -119.663837,
'name': 'California',
'name_a0': 'United States',
'name_abbr': 'CA',
'population': 37253956,
'unlc_id': 'US-CA',
'wd_id': 'Q99',
'wof_continent_id': 102191575,
'wof_country_id': 85633793,
'wof_id': 85688637},
'wd_id': 'Q62',
'wikipedia_wordcount': None,
'wk_page': 'San Francisco',
'wof_continent_id': 102191575,
'wof_country_id': 85633793,
'wof_id': 85922583,
'wof_region_id': 85688637}
idx = USStateIndex()
idx.load()
ca = idx['California'][0]
ca['name']
>> 'California'
ca['population']
>> 37253956
ca['area_m2']
>> 423822167986.13293
ca
>>
{'area_m2': 423822167986.13293,
'country_iso': 'US',
'fips_code': 'US06',
'gn_id': 5332921,
'gp_id': 2347563,
'hasc_id': 'US.CA',
'iso_id': 'US-CA',
'latitude': 37.215297,
'longitude': -119.663837,
'name': 'California',
'name_a0': 'United States',
'name_abbr': 'CA',
'population': 37253956,
'unlc_id': 'US-CA',
'wd_id': 'Q99',
'wof_continent_id': 102191575,
'wof_country_id': 85633793,
'wof_id': 85688637}