This is a report on how annual members and casual riders use
Cyclistic bikes differently. With the goal of converting casual riders
into annual members, Cyclistic first wants to answer how the two types
of riders use the bikes differently.
This insight into behaviour can help us figure out why casual riders
might want to convert to an annual membership, or if there are changes
we could make to the membership to encourage that transition.
The last 12 months of data available (January - December 2023) were downloaded from Cyclistic/Divvy’s data bucket on January 27th 2024. Each month’s data was unzipped locally using WinRAR and stored locally in a file structure for this project. This data is automatically collected by Cyclistic bikes and docking stations and should include all rides on bikes with no bias. This data has been made available by Motivate International Inc. (or Lyft Bikes and Scooters, LLC) under this license.
I want to use R for my data analysis and processing, so I’ll be importing the data into R and using some tools to clean and process the data.
First we’ll install (if needed)…
if(!require(tidyverse)) install.packages("tidyverse",repos = "http://cran.us.r-project.org")
## Loading required package: tidyverse
## Error: package or namespace load failed for 'tidyverse':
## .onAttach failed in attachNamespace() for 'tidyverse', details:
## call: NULL
## error: package or namespace load failed for 'readr':
## .onLoad failed in loadNamespace() for 'readr', details:
## call: loadNamespace(x)
## error: there is no package called 'tzdb'
## Warning: package 'tidyverse' is in use and will not be installed
…and load the tidyverse default packages plus lubridate for working with date-time data.
library(tidyverse)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Now let’s import each CSV into an R data frame.
jan_2023 <- read.csv("Data/202301-divvy-tripdata.csv")
feb_2023 <- read.csv("Data/202302-divvy-tripdata.csv")
mar_2023 <- read.csv("Data/202303-divvy-tripdata.csv")
apr_2023 <- read.csv("Data/202304-divvy-tripdata.csv")
may_2023 <- read.csv("Data/202305-divvy-tripdata.csv")
jun_2023 <- read.csv("Data/202306-divvy-tripdata.csv")
jul_2023 <- read.csv("Data/202307-divvy-tripdata.csv")
aug_2023 <- read.csv("Data/202308-divvy-tripdata.csv")
sep_2023 <- read.csv("Data/202309-divvy-tripdata.csv")
oct_2023 <- read.csv("Data/202310-divvy-tripdata.csv")
nov_2023 <- read.csv("Data/202311-divvy-tripdata.csv")
dec_2023 <- read.csv("Data/202312-divvy-tripdata.csv")
Next we can check out some features of our data - the columns, the data types, and the first few entries in each column.
colnames(jan_2023) #tells us the names of the columns
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
glimpse(jan_2023) #shows us the columns and the first entries that fit on screen
## Rows: 190,301
## Columns: 13
## $ ride_id <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
## $ rideable_type <chr> "electric_bike", "classic_bike", "electric_bike", "…
## $ started_at <chr> "2023-01-21 20:05:42", "2023-01-10 15:37:36", "2023…
## $ ended_at <chr> "2023-01-21 20:16:33", "2023-01-10 15:46:05", "2023…
## $ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
## $ start_station_id <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
## $ end_station_name <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
## $ end_station_id <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
## $ start_lat <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957, 4…
## $ start_lng <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.594…
## $ end_lat <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983, 4…
## $ end_lng <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.599…
## $ member_casual <chr> "member", "member", "casual", "member", "member", "…
glimpse(dec_2023) #Compare to January to make sure we have the same columns and data types
## Rows: 224,073
## Columns: 13
## $ ride_id <chr> "C9BD54F578F57246", "CDBD92F067FA620E", "ABC0858E52…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "2023-12-02 18:44:01", "2023-12-02 18:48:19", "2023…
## $ ended_at <chr> "2023-12-02 18:47:51", "2023-12-02 18:54:48", "2023…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat <dbl> 41.92, 41.92, 41.89, 41.95, 41.92, 41.91, 41.99, 42…
## $ start_lng <dbl> -87.66, -87.66, -87.62, -87.65, -87.64, -87.63, -87…
## $ end_lat <dbl> 41.92, 41.89, 41.90, 41.94, 41.93, 41.88, 42.00, 41…
## $ end_lng <dbl> -87.66, -87.64, -87.64, -87.65, -87.64, -87.65, -87…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
#as_tibble(jan_2023) #Shows us as much of the table as will fit on the screen and up to 10 rows of data
Everything looks good and ready to bring together. We’ll get all twelve months of data into a single data frame so we can analyze the whole year at the same time.
all_2023 <- dplyr::bind_rows(jan_2023, feb_2023, mar_2023, apr_2023, may_2023, jun_2023, jul_2023, aug_2023, sep_2023, oct_2023, nov_2023, dec_2023)
We prepare for analysis by adding some additional columns of data that will be useful for later aggregation. We will also delete some rides what indicate bikes were taken out of circulation as these were not taken by actual riders.
First let’s use our Ride Started data to extract the year, month, day, and hour each ride started. We’ll also calculate the day of the week and duration of each ride. Each of these will be stored in its own column.
#all_2023$date <- as.Date(all_2023$started_at)
#all_2023$year <- format(as.Date(all_2023$date), "%Y")
#all_2023$month <- format(as.Date(all_2023$date), "%m")
#all_2023$day <- format(as.Date(all_2023$date), "%d")
#all_2023$day_of_week <- format(as.Date(all_2023$date), "%A")
#all_2023$start_hour <- format(as.POSIXct(all_2023$started_at), "%H")
# Oh wait, let's use lubridate instead! Much easier code to read,
# plus we get to practice using the tidyverse more.
# Store the components of the date in their own columns
all_2023$date <- date(all_2023$started_at)
all_2023$year <- year(all_2023$started_at)
all_2023$month <- month(all_2023$started_at)
all_2023$day <- day(all_2023$started_at)
all_2023$start_hour <- hour(all_2023$started_at)
# Calculate and store the weekday of each ride
all_2023$day_of_week <- wday(all_2023$started_at)
# Calculate and store the duration of each ride...
all_2023$ride_length <- difftime(all_2023$ended_at, all_2023$started_at)
# ... and convert that duration to a number
all_2023$ride_length <- as.numeric(all_2023$ride_length)
Next we’ll remove some “bad” data, which consists of bikes being taken out of docks and checked for quality or rides with a negative length. We’ll store this cleaned data in a new frame.
all_2023_v2 <- all_2023[!(all_2023$start_station_name == "HQ QR" | all_2023$ride_length<0),]
We could consider cutting out some of the extremely long rides, but
without knowing the circumstances for them, this would be irresponsible.
Given the number of rides we have data for, we hope they will not skew
our averages much. Maybe we could repeat this analysis later with rides
over 24 hours removed or something similar.
There are also lots of rides missing start and end stations. This could
be for a number of reasons we may not be privy to in how Divvy operates,
so for now we will take all of the data into account for our next
step.
Now that we have a unified, cleaned, and organized data set, let’s take a look at what information we can pull out of that data.
We’ll start with a quick summary of ride length averages, quantiles, and end points. Remember that these values are in seconds.
summary(all_2023_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 325 572 1091 1015 5909344
Next we’ll break this down by casual users and members, working towards our goal. This is a little hard to read though.
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = mean)
## all_2023_v2$member_casual all_2023_v2$ride_length
## 1 casual 1694.9597
## 2 member 751.5815
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = median)
## all_2023_v2$member_casual all_2023_v2$ride_length
## 1 casual 711
## 2 member 511
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = min)
## all_2023_v2$member_casual all_2023_v2$ride_length
## 1 casual 0
## 2 member 0
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = max)
## all_2023_v2$member_casual all_2023_v2$ride_length
## 1 casual 5909344
## 2 member 90031
We can also look at the ridership on each day. Note that 1 = Sunday, 2 = Monday, …, 7 = Saturday.
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual + all_2023_v2$day_of_week, FUN = mean)
## all_2023_v2$member_casual all_2023_v2$day_of_week all_2023_v2$ride_length
## 1 casual 1 1972.1743
## 2 member 1 839.6244
## 3 casual 2 1662.8862
## 4 member 2 714.1429
## 5 casual 3 1505.0795
## 6 member 3 720.7687
## 7 casual 4 1458.2698
## 8 member 4 716.8575
## 9 casual 5 1483.5869
## 10 member 5 721.1883
## 11 casual 6 1635.7334
## 12 member 6 748.7234
## 13 casual 7 1928.4038
## 14 member 7 836.3739
Let’s glance at how ridership varies over the year, broken down by month.
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual + all_2023_v2$month, FUN = mean)
## all_2023_v2$member_casual all_2023_v2$month all_2023_v2$ride_length
## 1 casual 1 1374.8904
## 2 member 1 621.7059
## 3 casual 2 1391.5510
## 4 member 2 642.8619
## 5 casual 3 1283.4632
## 6 member 3 626.3867
## 7 casual 4 1660.3767
## 8 member 4 701.6340
## 9 casual 5 1711.2119
## 10 member 5 782.4131
## 11 casual 6 1764.4163
## 12 member 6 792.0162
## 13 casual 7 1940.0704
## 14 member 7 821.4992
## 15 casual 8 2114.8298
## 16 member 8 826.3609
## 17 casual 9 1511.2266
## 18 member 9 788.7250
## 19 casual 10 1372.3235
## 20 member 10 729.0645
## 21 casual 11 1196.1806
## 22 member 11 694.9790
## 23 casual 12 1196.2222
## 24 member 12 686.7966
So we can generally see that casual users took longer rides than members every day of the week. Now let’s also look at the number of rides each group took.
all_2023_v2 %>%
group_by(member_casual, day_of_week) %>%
summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
arrange(member_casual,day_of_week)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 Ă— 4
## # Groups: member_casual [2]
## member_casual day_of_week number_of_rides average_duration
## <chr> <dbl> <int> <dbl>
## 1 casual 1 335668 1972.
## 2 casual 2 234818 1663.
## 3 casual 3 246211 1505.
## 4 casual 4 249153 1458.
## 5 casual 5 270596 1484.
## 6 casual 6 311907 1636.
## 7 casual 7 410684 1928.
## 8 member 1 408829 840.
## 9 member 2 494558 714.
## 10 member 3 576743 721.
## 11 member 4 586438 717.
## 12 member 5 589572 721.
## 13 member 6 531582 749.
## 14 member 7 472846 836.
# Let's store that output for later use
summary_2023 <- all_2023_v2 %>%
group_by(member_casual, day_of_week) %>%
summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
From the resulting table we can see members took a lot more rides every day of the week. Let’s get some visuals for number of rides and average ride duration.
ggplot(data = summary_2023, mapping = aes(x = day_of_week, y = number_of_rides)) +
geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
#geom_label(mapping = aes(label = number_of_rides)) +
labs(title = "Number of Rides per Weekday", x = "Day of the Week", y = "Number of Rides", fill = "Rider Type", alt = "A graph with columns showing the number of rides per weekday. Each weekday has one column for members and another for casual riders." ) +
annotate("text", x = 1, y = -2000, label = "Sun" ) +
annotate("text", x = 2, y = -2000, label = "Mon" ) +
annotate("text", x = 3, y = -2000, label = "Tues" ) +
annotate("text", x = 4, y = -2000, label = "Wed" ) +
annotate("text", x = 5, y = -2000, label = "Thurs" ) +
annotate("text", x = 6, y = -2000, label = "Fri" ) +
annotate("text", x = 7, y = -2000, label = "Sat" )
ggplot(data = summary_2023, mapping = aes(x = day_of_week, y = average_duration)) +
geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
labs(title = "Average length of ride per Weekday", x = "Day of the Week", y = "Average Duration of Ride (in seconds)", fill = "Rider Type", alt = "A graph with columns showing the average duration of a ride per weekday. Each weekday has one column for members and another for casual riders." ) +
annotate("text", x = 1, y = -100, label = "Sun" ) +
annotate("text", x = 2, y = -100, label = "Mon" ) +
annotate("text", x = 3, y = -100, label = "Tues" ) +
annotate("text", x = 4, y = -100, label = "Wed" ) +
annotate("text", x = 5, y = -100, label = "Thurs" ) +
annotate("text", x = 6, y = -100, label = "Fri" ) +
annotate("text", x = 7, y = -100, label = "Sat" )
Now let’s take a look at when riders are using the bikes by summarizing based on the hour each ride starts.
starting_hour_summary <- all_2023_v2 %>%
group_by(member_casual, day_of_week, start_hour) %>%
summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week, start_hour)
## `summarise()` has grouped output by 'member_casual', 'day_of_week'. You can
## override using the `.groups` argument.
With rides starting each hour combined on a single graph, we can see the commute spikes, especially for members.
ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = number_of_rides)) +
geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
#geom_label(mapping = aes(label = number_of_rides)) +
labs(title = "Number of Rides starting each hour", x = "Hour", y = "Number of Rides", fill = "Rider Type", alt = "A graph with columns showing the number of rides starting each hour of the day. Each hour has one column for members and another for casual riders." )
Here is how the number of rides breaks down by starting hour on each day of the week. Remember that day 1 is Sunday.
ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = number_of_rides)) +
geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
#geom_label(mapping = aes(label = number_of_rides)) +
labs(title = "Number of Rides starting each hour", x = "Hour", y = "Number of Rides", fill = "Rider Type", alt = "A collections of graphs with columns showing the number of starting each hour of the day, one graph for each weekday. Each hour has one column for members and another for casual riders." ) +
facet_wrap(~day_of_week)
We can also look at behaviour of how long rides are based on what hour they started.
ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = average_duration)) +
geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
#geom_label(mapping = aes(label = number_of_rides)) +
labs(title = "Average duration of rides starting each hour", x = "Hour", y = "Average duration (in seconds)", fill = "Rider Type", alt = "A graph with columns showing the average duraion of rides starting each hour of the day. Each hour has one column for members and another for casual riders." )
Honestly this makes me think that casual riders who use a bike to get home after a night out fail to close out their ride properly which skews our average data.
We see that members take a lot more rides than casual riders (between
2 and 3 times the amount), and casual riders take longer rides on
average (which may be skewed by some extremely long outliers, but the
median supports this general trend).
When we break it down by day, we can see that on weekends the casual
riders take more rides and members take fewer, bringing their numbers
almost together, and both are taking longer rides on average than
weekdays. And when we break rides down by hour started, we can see the
biggest difference is the member rides spiking around commuting hours.
There is a definite possibility that casual riders who take bikes home
after a late night out do not park them correctly and thus skew the
average ride length with outliers that do not reflect how long the bike
was actually ridden (rides that start midnight - 4am).
What bike types did riders use? Are there locations where casual rides tend to start/end?
All the benefits of owning a bike (Show commuting, exercise outside in beautiful weather, meeting friends without needing to park, could segue into downsides of cars - parking, gas, check engine light, pollution…) with none of the hassle (show: bike storage, oiling the chain, bike stolen, mechanic?)