Base Station Positions¶
This example uses the Telecom Dataset, provided by Shanghai Telecom, to predict the optimal positions for base radio stations. This dataset contains more than 7.2 million records about people's Internet access through 3,233 base stations from 9,481 mobile phones over period of six months. The dataset can be found here. It consists of:
- user_id : User's ID
- start_time : When the record begins
- end_time : When the record ends
- latitude : Latitude of the base station
- longitude : Longitude of the base station
To complement the study, we'll also use the shanghai_districts dataset, which contains information on Shanghai's districts. Some of the columns include:
- name : Name of the district
- division_code : Division code of the district
- area : Area of the district in square kilometers
- population : Population of the district
- density : Density of the district
- geometry : Polygon of type 'Geometry' that contains the coordinates of the district
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's load the two datasets.
# Creating the schema
vp.drop("shanghai", method="schema")
vp.create_schema("shanghai")
# Libraries import
import matplotlib
from verticapy.geo import *
from verticapy.datasets import load_world
# Increasing video limit
matplotlib.rcParams['animation.embed_limit'] = 2**128
#######
# CDR #
#######
try:
cdr = vp.read_csv("data/cdr.csv",
schema="shanghai",
table_name="cdr",
sep=',',
parse_nrows=1000,)
except:
cdr = vp.vDataFrame("shanghai.cdr")
# Unique Row id: It will be used to compute the Polygons intersection
cdr["row_id"] = "ROW_NUMBER() OVER(ORDER BY user_id, start_time)"
######################
# Shanghai Districts #
######################
try:
shanghai_districts = vp.read_csv("shanghai/districts.csv",
schema="shanghai",
table_name="districts",
sep=',',)
except:
shanghai_districts = vp.vDataFrame("shanghai.districts")
# Converting the districts to Geometry
shanghai_districts = shanghai_districts["geometry"].apply("ST_GeomFromText({})")
# Creating Shanghai
shanghai_districts["district_level"] = st.case_when(shanghai_districts["number"] <= 7, 'Downtown',
shanghai_districts["number"] <= 11, 'Suburb1',
shanghai_districts["number"] <= 15, 'Suburb2',
'Suburb3')
These datasets contain the following:
cdr
shanghai_districts
Data Exploration¶
Detecting outliers¶
Since we're only concerned about the base stations in Shanghai, let's begin by finding the global outliers in our our Shanghai Telecom dataset, 'cdr.' First, we load the "World" dataset, a predefined dataset in VerticaPy, and then plot on a map of China to see if any points fall outside of Shanghai. We can then drop these outliers using the z-score method.
# Outliers
world = load_world()
china = world[world["country"] == "China"]
ax = china["geometry"].geo_plot(color = "white",
edgecolor='black',)
cdr.groupby(["longitude", "latitude"]).scatter(["longitude", "latitude"], ax=ax)
ax.set_title("Shanghai's Base Stations with Outliers")
# Dropping Outliers
cdr["longitude"].drop_outliers(threshold=2.0)
cdr["latitude"].drop_outliers(threshold=2.0)
ax = china["geometry"].geo_plot(color = "white",
edgecolor='black',)
cdr.groupby(["longitude", "latitude"]).scatter(["longitude", "latitude"], ax=ax)
ax.set_title("Shanghai's Base Stations without Outliers")
As we can see from the second plot, we've discarded the base stations outside of Shanghai.
Understanding Shanghai's Districts¶
Let's check the districts on the map. The Huangpu district is the urban "hub" of sorts and the most central of Shanghai's districts, so we'll pay it some special attention. We'll be referring to the Huangpu district as Shanghai's "downtown" in this study.
ax = shanghai_districts["geometry"].geo_plot(column = "district_level",
edgecolor='white',)
# Finding Centroids
centroids = shanghai_districts.select(["name",
"ST_X(ST_CENTROID(geometry))",
"ST_Y(ST_CENTROID(geometry))"]).to_list()
# Plotting the suburb names
for c in centroids[7:]:
ax.text(c[1], c[2], c[0], va="center", ha="center")
ax.set_title("Shanghai's Districts")
ax.text(121.43, 31.25, 'Downtown', va="center", ha="center")
ax2 = shanghai_districts.search("number <= 7")["geometry"].geo_plot(color="#CCCCCC",
edgecolor='white',)
# Plotting the downtown names
for c in centroids[:7]:
ax2.text(c[1], c[2], c[0], va="center", ha="center")
ax2.set_title("Shanghai's Downtown")
ax2
Districts' Activity¶
Let's examine the network activity of each of our districts. To do this, we need VerticaPy's Geospatial, which leverages Vertica geospatial functions. We begin by creating an index for the districts and then find the intersection between connections and districts. We'll visualize this with a bar chart race, which reflects each district's cumulative activity duration through time.
# Finding the intersections between each ping and each district
create_index(shanghai_districts,
gid="number",
g="geometry",
index="shanghai_districts",
overwrite=True,)
intersect_districts_cdr = intersect(cdr,
index="shanghai_districts",
gid="row_id",
x="longitude",
y="latitude",)
intersect_districts_cdr