Base Station Positions

This example uses the Telecom Dataset, provided by Shanghai Telecom, to predict the optimal positions for base radio stations. This dataset contains more than 7.2 million records about people's Internet access through 3,233 base stations from 9,481 mobile phones over period of six months. The dataset can be found here. It consists of:

  • user_id : User's ID
  • start_time : When the record begins
  • end_time : When the record ends
  • latitude : Latitude of the base station
  • longitude : Longitude of the base station

To complement the study, we'll also use the shanghai_districts dataset, which contains information on Shanghai's districts. Some of the columns include:

  • name : Name of the district
  • division_code : Division code of the district
  • area : Area of the district in square kilometers
  • population : Population of the district
  • density : Density of the district
  • geometry : Polygon of type 'Geometry' that contains the coordinates of the district

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's load the two datasets.

In [7]:
# Creating the schema
vp.drop("shanghai", method="schema")
vp.create_schema("shanghai")

# Libraries import
import matplotlib
from verticapy.geo import *
from verticapy.datasets import load_world

# Increasing video limit
matplotlib.rcParams['animation.embed_limit'] = 2**128

#######
# CDR #
#######
try:
    cdr = vp.read_csv("data/cdr.csv", 
                      schema="shanghai", 
                      table_name="cdr", 
                      sep=',',
                      parse_nrows=1000,)
except:
    cdr = vp.vDataFrame("shanghai.cdr")
# Unique Row id: It will be used to compute the Polygons intersection
cdr["row_id"] = "ROW_NUMBER() OVER(ORDER BY user_id, start_time)"

######################
# Shanghai Districts #
######################
try:
    shanghai_districts = vp.read_csv("shanghai/districts.csv", 
                                     schema="shanghai", 
                                     table_name="districts", 
                                     sep=',',)
except:
    shanghai_districts = vp.vDataFrame("shanghai.districts")
# Converting the districts to Geometry
shanghai_districts = shanghai_districts["geometry"].apply("ST_GeomFromText({})")
# Creating Shanghai 
shanghai_districts["district_level"] = st.case_when(shanghai_districts["number"] <= 7, 'Downtown',
                                                    shanghai_districts["number"] <= 11, 'Suburb1',
                                                    shanghai_districts["number"] <= 15, 'Suburb2',
                                                    'Suburb3')

These datasets contain the following:

In [5]:
cdr
Out[5]:
123
Uuid
📅
start_time
Timestamp
📅
end_time
Timestamp
🌎
latitude
Numeric(11,7)
🌎
longitude
Numeric(12,7)
123
row_id
Integer
12014-06-19 09:39:002014-06-19 09:39:0031.201279121.7094851
22014-07-24 08:54:002014-07-24 08:54:0031.201279121.7094852
32014-08-25 10:53:002014-08-25 10:54:0031.201279121.7094853
42014-09-22 12:42:002014-09-22 12:43:0031.201279121.7094854
52014-10-20 16:20:002014-10-20 16:20:0031.201279121.7094855
62014-11-26 09:31:002014-11-26 09:32:0031.201279121.7094856
72014-06-17 19:33:002014-06-17 20:12:0031.266813121.5126587
82014-07-02 19:14:002014-07-02 19:24:0031.266813121.5126588
92014-07-02 19:25:002014-07-02 19:35:0031.266813121.5126589
102014-07-02 19:36:002014-07-02 19:41:0031.266813121.51265810
112014-07-02 19:42:002014-07-02 19:47:0031.266813121.51265811
122014-08-05 09:42:002014-08-05 09:50:0031.266813121.51265812
132014-08-05 09:51:002014-08-05 10:09:0031.266813121.51265813
142014-08-05 10:10:002014-08-05 10:24:0031.266813121.51265814
152014-09-08 10:30:002014-09-08 10:52:0031.266813121.51265815
162014-10-01 20:57:002014-10-01 21:14:0031.266813121.51265816
172014-10-01 21:15:002014-10-01 21:29:0031.266813121.51265817
182014-11-03 21:44:002014-11-03 22:00:0031.266813121.51265818
192014-11-03 22:00:002014-11-03 22:11:0031.266813121.51265819
202014-11-03 22:11:002014-11-03 22:14:0031.266813121.51265820
212014-07-24 12:46:002014-07-24 12:48:0031.296226121.49400521
222014-07-24 12:49:002014-07-24 12:51:0031.296226121.49400522
232014-07-24 12:52:002014-07-24 12:55:0031.296226121.49400523
242014-07-24 12:56:002014-07-24 13:03:0031.296226121.49400524
252014-09-24 18:11:002014-09-24 18:17:0031.294393121.49729825
262014-09-24 18:21:002014-09-24 18:26:0031.294393121.49729826
272014-10-19 10:27:002014-10-19 10:30:0031.436148121.22441627
282014-11-16 17:36:002014-11-16 17:39:0031.436148121.22441628
292014-06-01 02:37:002014-06-01 03:00:0031.179835121.42950229
302014-06-01 03:01:002014-06-01 03:01:0031.172757121.42596530
312014-06-01 10:21:002014-06-01 11:02:0031.176998121.42358631
322014-06-01 11:18:002014-06-01 11:43:0031.179835121.42950232
332014-06-02 20:23:002014-06-02 20:24:0031.17605121.43122533
342014-06-04 23:01:002014-06-04 23:17:0031.179835121.42950234
352014-06-05 03:33:002014-06-05 03:34:0031.176998121.42358635
362014-06-05 13:21:002014-06-05 13:25:0031.176998121.42358636
372014-06-06 14:01:002014-06-06 14:04:0031.179835121.42950237
382014-06-13 09:39:002014-06-13 09:49:0031.180175121.42230338
392014-06-14 15:54:002014-06-14 15:55:0031.176998121.42358639
402014-06-14 15:55:002014-06-14 15:56:0031.176998121.42358640
412014-06-14 18:51:002014-06-14 19:09:0031.172757121.42596541
422014-06-15 16:26:002014-06-15 16:41:0031.176813121.43911442
432014-06-16 12:20:002014-06-16 13:52:0031.176998121.42358643
442014-06-16 15:42:002014-06-16 15:44:0031.120864121.38914144
452014-06-17 17:34:002014-06-17 19:17:0031.179835121.42950245
462014-06-17 19:29:002014-06-17 19:32:0031.176998121.42358646
472014-06-17 20:15:002014-06-17 20:26:0031.172757121.42596547
482014-06-17 21:37:002014-06-17 22:13:0031.179835121.42950248
492014-06-18 04:29:002014-06-18 04:38:0031.176998121.42358649
502014-06-18 05:26:002014-06-18 05:29:0031.176998121.42358650
512014-06-18 15:00:002014-06-18 17:58:0031.179835121.42950251
522014-06-18 18:02:002014-06-18 21:02:0031.179835121.42950252
532014-06-18 21:12:002014-06-18 21:30:0031.176998121.42358653
542014-06-20 10:56:002014-06-20 11:10:0031.179835121.42950254
552014-06-20 11:13:002014-06-20 14:09:0031.176998121.42358655
562014-06-20 16:25:002014-06-20 16:39:0031.157677121.42453156
572014-06-20 21:46:002014-06-20 21:47:0031.179835121.42950257
582014-06-21 14:42:002014-06-21 15:08:0031.176998121.42358658
592014-06-22 22:19:002014-06-22 22:22:0031.180175121.42230359
602014-06-22 22:44:002014-06-22 22:52:0031.176998121.42358660
612014-06-22 22:49:002014-06-22 22:50:0031.176998121.42358661
622014-06-23 07:54:002014-06-23 07:56:0031.176998121.42358662
632014-06-24 11:18:002014-06-24 14:18:0031.172757121.42596563
642014-06-24 14:27:002014-06-24 17:26:0031.179835121.42950264
652014-06-24 20:36:002014-06-24 21:19:0031.179835121.42950265
662014-06-24 21:21:002014-06-24 22:57:0031.176998121.42358666
672014-06-25 00:22:002014-06-25 03:18:0031.172757121.42596567
682014-06-25 06:28:002014-06-25 06:59:0031.176998121.42358668
692014-06-25 07:04:002014-06-25 08:16:0031.176813121.43911469
702014-06-25 08:22:002014-06-25 10:58:0031.176998121.42358670
712014-06-25 14:46:002014-06-25 15:46:0031.176998121.42358671
722014-06-25 18:55:002014-06-25 19:40:0031.176998121.42358672
732014-06-25 19:41:002014-06-25 22:41:0031.176998121.42358673
742014-06-25 22:45:002014-06-25 22:58:0031.176998121.42358674
752014-06-25 23:03:002014-06-26 01:42:0031.176998121.42358675
762014-06-26 07:02:002014-06-26 07:43:0031.172757121.42596576
772014-06-26 07:47:002014-06-26 09:47:0031.176998121.42358677
782014-06-26 16:10:002014-06-26 16:56:0031.176998121.42358678
792014-06-26 16:57:002014-06-26 16:58:0031.176998121.42358679
802014-06-26 21:46:002014-06-26 22:15:0031.176998121.42358680
812014-06-26 23:14:002014-06-26 23:20:0031.172757121.42596581
822014-06-27 07:15:002014-06-27 07:22:0031.176998121.42358682
832014-06-27 14:39:002014-06-27 17:36:0031.180175121.42230383
842014-06-27 17:41:002014-06-27 19:54:0031.176998121.42358684
852014-06-27 22:05:002014-06-27 22:31:0031.176998121.42358685
862014-06-27 22:31:002014-06-27 22:52:0031.179835121.42950286
872014-06-27 23:16:002014-06-27 23:24:0031.176998121.42358687
882014-06-30 08:20:002014-06-30 09:04:0031.166365121.41979288
892014-06-30 09:36:002014-06-30 09:42:0031.166365121.41979289
902014-06-30 12:37:002014-06-30 15:35:0031.172757121.42596590
912014-06-30 15:39:002014-06-30 16:36:0031.176998121.42358691
922014-06-30 19:26:002014-06-30 21:59:0031.180175121.42230392
932014-06-30 22:48:002014-06-30 22:57:0031.172757121.42596593
942014-07-01 06:36:002014-07-01 06:58:0031.180175121.42230394
952014-07-01 09:34:002014-07-01 09:41:0031.171686121.44122195
962014-07-01 09:46:002014-07-01 10:00:0031.176813121.43911496
972014-07-02 12:56:002014-07-02 12:59:0031.138371121.41697597
982014-07-02 13:46:002014-07-02 14:00:0031.127837121.38065498
992014-07-03 14:55:002014-07-03 15:31:0031.194194121.45658999
1002014-07-04 10:51:002014-07-04 13:50:0031.179835121.429502100
Rows: 1-100 | Columns: 6
In [6]:
shanghai_districts
Out[6]:
123
number
Int
123
division_code
Int
Abc
code
Varchar(20)
Abc
name
Varchar(36)
Abc
local_name
Varchar(24)
123
area
Numeric(9,3)
123
population
Int
123
density
Int
🌎
Long varbinary(1040064)
Abc
district_level
Varchar(8)
11310101HGPHuangpu District黄浦区20.4665380031955Downtown
22310104XHIXuhui District徐汇区54.76108440019803Downtown
33310105CNQChangning District长宁区38.369400018120Downtown
44310106JAQJing'an District静安区36.88106280028818Downtown
55310107PTQPutuo District普陀区54.83128190023380Downtown
66310109HKQHongkou District虹口区23.4879700033944Downtown
77310110YPUYangpu District杨浦区60.73131270021615Downtown
88310115PDXPudong浦东新区1210.4155502004585Suburb1
99310112MHQMinhang District闵行区370.752543500686Suburb1
1010310113BAOBaoshan District宝山区270.9920423007536Suburb1
1111310114JDGJiading District嘉定区464.215889003423Suburb1
1212310116JSHJinshan District金山区586.058050001374Suburb2
1313310117SOJSongjiang District松江区605.6417622002910Suburb2
1414310118QPUQingpu District青浦区670.1412191001819Suburb2
1515310120FXIFengxian District奉贤区687.3911520001676Suburb2
1616310151CMGChongming District崇明区1185.49688100580Suburb3
Rows: 1-16 | Columns: 10

Data Exploration

Detecting outliers

Since we're only concerned about the base stations in Shanghai, let's begin by finding the global outliers in our our Shanghai Telecom dataset, 'cdr.' First, we load the "World" dataset, a predefined dataset in VerticaPy, and then plot on a map of China to see if any points fall outside of Shanghai. We can then drop these outliers using the z-score method.

In [8]:
# Outliers
world = load_world()
china = world[world["country"] == "China"]
ax = china["geometry"].geo_plot(color = "white",
                                edgecolor='black',)
cdr.groupby(["longitude", "latitude"]).scatter(["longitude", "latitude"], ax=ax)
ax.set_title("Shanghai's Base Stations with Outliers")

# Dropping Outliers
cdr["longitude"].drop_outliers(threshold=2.0)
cdr["latitude"].drop_outliers(threshold=2.0)

ax = china["geometry"].geo_plot(color = "white",
                                edgecolor='black',)
cdr.groupby(["longitude", "latitude"]).scatter(["longitude", "latitude"], ax=ax)
ax.set_title("Shanghai's Base Stations without Outliers")
72490 elements were filtered.
113578 elements were filtered.
Out[8]:
Text(0.5, 1.0, "Shanghai's Base Stations without Outliers")

As we can see from the second plot, we've discarded the base stations outside of Shanghai.

Understanding Shanghai's Districts

Let's check the districts on the map. The Huangpu district is the urban "hub" of sorts and the most central of Shanghai's districts, so we'll pay it some special attention. We'll be referring to the Huangpu district as Shanghai's "downtown" in this study.

In [9]:
ax = shanghai_districts["geometry"].geo_plot(column = "district_level",
                                             edgecolor='white',)
# Finding Centroids
centroids = shanghai_districts.select(["name", 
                                       "ST_X(ST_CENTROID(geometry))", 
                                       "ST_Y(ST_CENTROID(geometry))"]).to_list()
# Plotting the suburb names
for c in centroids[7:]:
    ax.text(c[1], c[2], c[0], va="center", ha="center")
ax.set_title("Shanghai's Districts")
ax.text(121.43, 31.25, 'Downtown', va="center", ha="center")
ax2 = shanghai_districts.search("number <= 7")["geometry"].geo_plot(color="#CCCCCC",
                                                                    edgecolor='white',)
# Plotting the downtown names
for c in centroids[:7]:
    ax2.text(c[1], c[2], c[0], va="center", ha="center")
ax2.set_title("Shanghai's Downtown")
ax2
Out[9]:
<AxesSubplot:title={'center':"Shanghai's Downtown"}>

Districts' Activity

Let's examine the network activity of each of our districts. To do this, we need VerticaPy's Geospatial, which leverages Vertica geospatial functions. We begin by creating an index for the districts and then find the intersection between connections and districts. We'll visualize this with a bar chart race, which reflects each district's cumulative activity duration through time.

In [10]:
# Finding the intersections between each ping and each district
create_index(shanghai_districts, 
             gid="number", 
             g="geometry", 
             index="shanghai_districts",
             overwrite=True,)
intersect_districts_cdr = intersect(cdr,
                                    index="shanghai_districts",
                                    gid="row_id",
                                    x="longitude",
                                    y="latitude",)
intersect_districts_cdr
Out[10]: