Data

twgeo.data.input

twgeo.data.input.read_csv_data(csv_filename: str, location_column_idx: int, tweet_txt_column_idx: int)

Pre-process raw tweet data from a csv file. For each row, this function will:

  1. Tokenize the tweet text.
  2. Limit repeated characters to a maximum of 2. For example: ‘Greeeeeetings’ becomes ‘Greetings’.
  3. Perform Porter stemming on each token.
  4. Convert each token to lower case.
Parameters:
  • csv_filename
  • location_column_idx – The zero-based index of the CSV column that contains the location information. The data itself must be a discrete value (a string or integer).
  • tweet_txt_column_idx – The zero-based index of the CSV column that contains the tweet text.
Returns:

Tuple (preprocessed_tweets, locations)

twgeo.data.reverse_geocode

class twgeo.data.reverse_geocode.ReverseGeocode

Bases: object

Convenience module to find the US state and Census region from a pair of global coordinates. This class assumes you have a Postgres PostGIS database installed locally. To setup PostGIS with US Census data, follow the instructions found here.

get_state_index(state_abbrev) → int

Get the integer value of the given state.

Parameters:state_abbrev – Example: ‘FL, NY’
Returns:
get_state_region(state_abbrev) → int

Get the integer value of the Census region of the given state.

Parameters:state_abbrev – Example: ‘FL, NY’
Returns:
get_state_region_name(state_abbrev) → str

Get the name of the Census region of a given state.

Parameters:state_abbrev – Example: ‘FL, NY’
Returns:
reverse_geocode_state(location) → str

Find the corresponding US state of a given pair of coordinates.

Parameters:location – A tuple containing the (latitude, longitude)
Returns:The corresponding state abbreviation. Example: ‘FL, NY’

twgeo.data.twus_dataset

Built-in dataset of ~450K US based users.

twgeo.data.twus_dataset.load_region_data(size='large')

Training samples labeled with the corresponding US Census Region.

Param:size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples.
Returns:Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)
twgeo.data.twus_dataset.load_state_data(size='large')

Training samples labeled with the corresponding US State.

Param:size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples.
Returns:Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)

Module contents