Data¶
twgeo.data.input¶
-
twgeo.data.input.
read_csv_data
(csv_filename: str, location_column_idx: int, tweet_txt_column_idx: int)¶ Pre-process raw tweet data from a csv file. For each row, this function will:
- Tokenize the tweet text.
- Limit repeated characters to a maximum of 2. For example: ‘Greeeeeetings’ becomes ‘Greetings’.
- Perform Porter stemming on each token.
- Convert each token to lower case.
Parameters: - csv_filename –
- location_column_idx – The zero-based index of the CSV column that contains the location information. The data itself must be a discrete value (a string or integer).
- tweet_txt_column_idx – The zero-based index of the CSV column that contains the tweet text.
Returns: Tuple (preprocessed_tweets, locations)
twgeo.data.reverse_geocode¶
-
class
twgeo.data.reverse_geocode.
ReverseGeocode
¶ Bases:
object
Convenience module to find the US state and Census region from a pair of global coordinates. This class assumes you have a Postgres PostGIS database installed locally. To setup PostGIS with US Census data, follow the instructions found here.
-
get_state_index
(state_abbrev) → int¶ Get the integer value of the given state.
Parameters: state_abbrev – Example: ‘FL, NY’ Returns:
-
get_state_region
(state_abbrev) → int¶ Get the integer value of the Census region of the given state.
Parameters: state_abbrev – Example: ‘FL, NY’ Returns:
-
get_state_region_name
(state_abbrev) → str¶ Get the name of the Census region of a given state.
Parameters: state_abbrev – Example: ‘FL, NY’ Returns:
-
reverse_geocode_state
(location) → str¶ Find the corresponding US state of a given pair of coordinates.
Parameters: location – A tuple containing the (latitude, longitude) Returns: The corresponding state abbreviation. Example: ‘FL, NY’
-
twgeo.data.twus_dataset¶
Built-in dataset of ~450K US based users.
-
twgeo.data.twus_dataset.
load_region_data
(size='large')¶ Training samples labeled with the corresponding US Census Region.
Param: size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples. Returns: Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)
-
twgeo.data.twus_dataset.
load_state_data
(size='large')¶ Training samples labeled with the corresponding US State.
Param: size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples. Returns: Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)