Data¶

twgeo.data.input¶

twgeo.data.input.read_csv_data(csv_filename: str, location_column_idx: int, tweet_txt_column_idx: int)¶

Pre-process raw tweet data from a csv file. For each row, this function will:

Tokenize the tweet text.

Limit repeated characters to a maximum of 2. For example: ‘Greeeeeetings’ becomes ‘Greetings’.

Perform Porter stemming on each token.

Convert each token to lower case.

Parameters:	csv_filename – location_column_idx – The zero-based index of the CSV column that contains the location information. The data itself must be a discrete value (a string or integer). tweet_txt_column_idx – The zero-based index of the CSV column that contains the tweet text.
Returns:	Tuple (preprocessed_tweets, locations)

twgeo.data.reverse_geocode¶

class twgeo.data.reverse_geocode.ReverseGeocode¶

Bases: object

Convenience module to find the US state and Census region from a pair of global coordinates. This class assumes you have a Postgres PostGIS database installed locally. To setup PostGIS with US Census data, follow the instructions found here.

get_state_index(state_abbrev) → int¶

Get the integer value of the given state.

Parameters:	state_abbrev – Example: ‘FL, NY’
Returns:

get_state_region(state_abbrev) → int¶

Get the integer value of the Census region of the given state.

Parameters:	state_abbrev – Example: ‘FL, NY’
Returns:

get_state_region_name(state_abbrev) → str¶

Get the name of the Census region of a given state.

Parameters:	state_abbrev – Example: ‘FL, NY’
Returns:

reverse_geocode_state(location) → str¶

Find the corresponding US state of a given pair of coordinates.

Parameters:	location – A tuple containing the (latitude, longitude)
Returns:	The corresponding state abbreviation. Example: ‘FL, NY’

twgeo.data.twus_dataset¶

Built-in dataset of ~450K US based users.

twgeo.data.twus_dataset.load_region_data(size='large')¶

Training samples labeled with the corresponding US Census Region.

Param:	size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples.
Returns:	Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)

twgeo.data.twus_dataset.load_state_data(size='large')¶

Training samples labeled with the corresponding US State.

Param:	size: ‘micro’: 10,000 samples ‘small’: 50,000 samples, ‘mid’: 100,000 samples, ‘large’: 410,000 samples.
Returns:	Tuple(x_train, y_train, x_dev, y_dev, x_test, y_test)

Data¶

twgeo.data.input¶

twgeo.data.reverse_geocode¶

twgeo.data.twus_dataset¶

Module contents¶