Cyclistic Membership Drive Report

How can Cyclistic convert casual riders into annual members?

Case Study 1A

This is a report on how annual members and casual riders use Cyclistic bikes differently. With the goal of converting casual riders into annual members, Cyclistic first wants to answer how the two types of riders use the bikes differently.
This insight into behaviour can help us figure out why casual riders might want to convert to an annual membership, or if there are changes we could make to the membership to encourage that transition.

Data Sources

The last 12 months of data available (January - December 2023) were downloaded from Cyclistic/Divvy’s data bucket on January 27th 2024. Each month’s data was unzipped locally using WinRAR and stored locally in a file structure for this project. This data is automatically collected by Cyclistic bikes and docking stations and should include all rides on bikes with no bias. This data has been made available by Motivate International Inc. (or Lyft Bikes and Scooters, LLC) under this license.

Data Cleaning

I want to use R for my data analysis and processing, so I’ll be importing the data into R and using some tools to clean and process the data.

Setting up our environment

First we’ll install (if needed)…

if(!require(tidyverse)) install.packages("tidyverse",repos = "http://cran.us.r-project.org")
## Loading required package: tidyverse
## Error: package or namespace load failed for 'tidyverse':
##  .onAttach failed in attachNamespace() for 'tidyverse', details:
##   call: NULL
##   error: package or namespace load failed for 'readr':
##  .onLoad failed in loadNamespace() for 'readr', details:
##   call: loadNamespace(x)
##   error: there is no package called 'tzdb'
## Warning: package 'tidyverse' is in use and will not be installed

…and load the tidyverse default packages plus lubridate for working with date-time data.

library(tidyverse)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Importing Data

Now let’s import each CSV into an R data frame.

jan_2023 <- read.csv("Data/202301-divvy-tripdata.csv")
feb_2023 <- read.csv("Data/202302-divvy-tripdata.csv")
mar_2023 <- read.csv("Data/202303-divvy-tripdata.csv")
apr_2023 <- read.csv("Data/202304-divvy-tripdata.csv")

may_2023 <- read.csv("Data/202305-divvy-tripdata.csv")
jun_2023 <- read.csv("Data/202306-divvy-tripdata.csv")
jul_2023 <- read.csv("Data/202307-divvy-tripdata.csv")
aug_2023 <- read.csv("Data/202308-divvy-tripdata.csv")

sep_2023 <- read.csv("Data/202309-divvy-tripdata.csv")
oct_2023 <- read.csv("Data/202310-divvy-tripdata.csv")
nov_2023 <- read.csv("Data/202311-divvy-tripdata.csv")
dec_2023 <- read.csv("Data/202312-divvy-tripdata.csv")

Next we can check out some features of our data - the columns, the data types, and the first few entries in each column.

colnames(jan_2023) #tells us the names of the columns 
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
glimpse(jan_2023) #shows us the columns and the first entries that fit on screen
## Rows: 190,301
## Columns: 13
## $ ride_id            <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
## $ rideable_type      <chr> "electric_bike", "classic_bike", "electric_bike", "…
## $ started_at         <chr> "2023-01-21 20:05:42", "2023-01-10 15:37:36", "2023…
## $ ended_at           <chr> "2023-01-21 20:16:33", "2023-01-10 15:46:05", "2023…
## $ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
## $ start_station_id   <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
## $ end_station_name   <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
## $ end_station_id     <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
## $ start_lat          <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957, 4…
## $ start_lng          <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.594…
## $ end_lat            <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983, 4…
## $ end_lng            <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.599…
## $ member_casual      <chr> "member", "member", "casual", "member", "member", "…
glimpse(dec_2023) #Compare to January to make sure we have the same columns and data types
## Rows: 224,073
## Columns: 13
## $ ride_id            <chr> "C9BD54F578F57246", "CDBD92F067FA620E", "ABC0858E52…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2023-12-02 18:44:01", "2023-12-02 18:48:19", "2023…
## $ ended_at           <chr> "2023-12-02 18:47:51", "2023-12-02 18:54:48", "2023…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.92, 41.92, 41.89, 41.95, 41.92, 41.91, 41.99, 42…
## $ start_lng          <dbl> -87.66, -87.66, -87.62, -87.65, -87.64, -87.63, -87…
## $ end_lat            <dbl> 41.92, 41.89, 41.90, 41.94, 41.93, 41.88, 42.00, 41…
## $ end_lng            <dbl> -87.66, -87.64, -87.64, -87.65, -87.64, -87.65, -87…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
#as_tibble(jan_2023) #Shows us as much of the table as will fit on the screen and up to 10 rows of data

Everything looks good and ready to bring together. We’ll get all twelve months of data into a single data frame so we can analyze the whole year at the same time.

all_2023 <- dplyr::bind_rows(jan_2023, feb_2023, mar_2023, apr_2023, may_2023, jun_2023, jul_2023, aug_2023, sep_2023, oct_2023, nov_2023, dec_2023)

Clean up the data

We prepare for analysis by adding some additional columns of data that will be useful for later aggregation. We will also delete some rides what indicate bikes were taken out of circulation as these were not taken by actual riders.

First let’s use our Ride Started data to extract the year, month, day, and hour each ride started. We’ll also calculate the day of the week and duration of each ride. Each of these will be stored in its own column.

#all_2023$date <- as.Date(all_2023$started_at)
#all_2023$year <- format(as.Date(all_2023$date), "%Y")
#all_2023$month <- format(as.Date(all_2023$date), "%m")
#all_2023$day <- format(as.Date(all_2023$date), "%d")
#all_2023$day_of_week <- format(as.Date(all_2023$date), "%A")
#all_2023$start_hour <- format(as.POSIXct(all_2023$started_at), "%H")

# Oh wait, let's use lubridate instead!  Much easier code to read,
# plus we get to practice using the tidyverse more.  

# Store the components of the date in their own columns 
all_2023$date <- date(all_2023$started_at)
all_2023$year <- year(all_2023$started_at)
all_2023$month <- month(all_2023$started_at)
all_2023$day <- day(all_2023$started_at)
all_2023$start_hour <- hour(all_2023$started_at)

# Calculate and store the weekday of each ride
all_2023$day_of_week <- wday(all_2023$started_at)
# Calculate and store the duration of each ride...
all_2023$ride_length <- difftime(all_2023$ended_at, all_2023$started_at)
# ... and convert that duration to a number
all_2023$ride_length <- as.numeric(all_2023$ride_length)

Next we’ll remove some “bad” data, which consists of bikes being taken out of docks and checked for quality or rides with a negative length. We’ll store this cleaned data in a new frame.

all_2023_v2 <- all_2023[!(all_2023$start_station_name == "HQ QR" | all_2023$ride_length<0),]

Other data considerations

We could consider cutting out some of the extremely long rides, but without knowing the circumstances for them, this would be irresponsible. Given the number of rides we have data for, we hope they will not skew our averages much. Maybe we could repeat this analysis later with rides over 24 hours removed or something similar.
There are also lots of rides missing start and end stations. This could be for a number of reasons we may not be privy to in how Divvy operates, so for now we will take all of the data into account for our next step.

Data Analysis

Now that we have a unified, cleaned, and organized data set, let’s take a look at what information we can pull out of that data.

We’ll start with a quick summary of ride length averages, quantiles, and end points. Remember that these values are in seconds.

summary(all_2023_v2$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     325     572    1091    1015 5909344

Next we’ll break this down by casual users and members, working towards our goal. This is a little hard to read though.

aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = mean)
##   all_2023_v2$member_casual all_2023_v2$ride_length
## 1                    casual               1694.9597
## 2                    member                751.5815
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = median)
##   all_2023_v2$member_casual all_2023_v2$ride_length
## 1                    casual                     711
## 2                    member                     511
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = min)
##   all_2023_v2$member_casual all_2023_v2$ride_length
## 1                    casual                       0
## 2                    member                       0
aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual, FUN = max)
##   all_2023_v2$member_casual all_2023_v2$ride_length
## 1                    casual                 5909344
## 2                    member                   90031

We can also look at the ridership on each day. Note that 1 = Sunday, 2 = Monday, …, 7 = Saturday.

aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual + all_2023_v2$day_of_week, FUN = mean) 
##    all_2023_v2$member_casual all_2023_v2$day_of_week all_2023_v2$ride_length
## 1                     casual                       1               1972.1743
## 2                     member                       1                839.6244
## 3                     casual                       2               1662.8862
## 4                     member                       2                714.1429
## 5                     casual                       3               1505.0795
## 6                     member                       3                720.7687
## 7                     casual                       4               1458.2698
## 8                     member                       4                716.8575
## 9                     casual                       5               1483.5869
## 10                    member                       5                721.1883
## 11                    casual                       6               1635.7334
## 12                    member                       6                748.7234
## 13                    casual                       7               1928.4038
## 14                    member                       7                836.3739

Let’s glance at how ridership varies over the year, broken down by month.

aggregate(all_2023_v2$ride_length ~ all_2023_v2$member_casual + all_2023_v2$month, FUN = mean) 
##    all_2023_v2$member_casual all_2023_v2$month all_2023_v2$ride_length
## 1                     casual                 1               1374.8904
## 2                     member                 1                621.7059
## 3                     casual                 2               1391.5510
## 4                     member                 2                642.8619
## 5                     casual                 3               1283.4632
## 6                     member                 3                626.3867
## 7                     casual                 4               1660.3767
## 8                     member                 4                701.6340
## 9                     casual                 5               1711.2119
## 10                    member                 5                782.4131
## 11                    casual                 6               1764.4163
## 12                    member                 6                792.0162
## 13                    casual                 7               1940.0704
## 14                    member                 7                821.4992
## 15                    casual                 8               2114.8298
## 16                    member                 8                826.3609
## 17                    casual                 9               1511.2266
## 18                    member                 9                788.7250
## 19                    casual                10               1372.3235
## 20                    member                10                729.0645
## 21                    casual                11               1196.1806
## 22                    member                11                694.9790
## 23                    casual                12               1196.2222
## 24                    member                12                686.7966

Summary Table and Visuals

So we can generally see that casual users took longer rides than members every day of the week. Now let’s also look at the number of rides each group took.

all_2023_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
  arrange(member_casual,day_of_week)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 Ă— 4
## # Groups:   member_casual [2]
##    member_casual day_of_week number_of_rides average_duration
##    <chr>               <dbl>           <int>            <dbl>
##  1 casual                  1          335668            1972.
##  2 casual                  2          234818            1663.
##  3 casual                  3          246211            1505.
##  4 casual                  4          249153            1458.
##  5 casual                  5          270596            1484.
##  6 casual                  6          311907            1636.
##  7 casual                  7          410684            1928.
##  8 member                  1          408829             840.
##  9 member                  2          494558             714.
## 10 member                  3          576743             721.
## 11 member                  4          586438             717.
## 12 member                  5          589572             721.
## 13 member                  6          531582             749.
## 14 member                  7          472846             836.
# Let's store that output for later use
summary_2023 <- all_2023_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
  arrange(member_casual, day_of_week) 
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

From the resulting table we can see members took a lot more rides every day of the week. Let’s get some visuals for number of rides and average ride duration.

ggplot(data = summary_2023, mapping = aes(x = day_of_week, y = number_of_rides)) + 
  geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
  #geom_label(mapping = aes(label = number_of_rides)) + 
  labs(title = "Number of Rides per Weekday", x = "Day of the Week", y = "Number of Rides", fill = "Rider Type", alt = "A graph with columns showing the number of rides per weekday.  Each weekday has one column for members and another for casual riders." ) +
  annotate("text", x = 1, y = -2000, label = "Sun" ) + 
  annotate("text", x = 2, y = -2000, label = "Mon" ) + 
  annotate("text", x = 3, y = -2000, label = "Tues" ) + 
  annotate("text", x = 4, y = -2000, label = "Wed" ) + 
  annotate("text", x = 5, y = -2000, label = "Thurs" ) + 
  annotate("text", x = 6, y = -2000, label = "Fri" ) + 
  annotate("text", x = 7, y = -2000, label = "Sat" ) 

ggplot(data = summary_2023, mapping = aes(x = day_of_week, y = average_duration)) + 
  geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
  labs(title = "Average length of ride per Weekday", x = "Day of the Week", y = "Average Duration of Ride (in seconds)", fill = "Rider Type", alt = "A graph with columns showing the average duration of a ride per weekday.  Each weekday has one column for members and another for casual riders." ) +
  annotate("text", x = 1, y = -100, label = "Sun" ) + 
  annotate("text", x = 2, y = -100, label = "Mon" ) + 
  annotate("text", x = 3, y = -100, label = "Tues" ) + 
  annotate("text", x = 4, y = -100, label = "Wed" ) + 
  annotate("text", x = 5, y = -100, label = "Thurs" ) + 
  annotate("text", x = 6, y = -100, label = "Fri" ) + 
  annotate("text", x = 7, y = -100, label = "Sat" ) 

When do rides start?

Now let’s take a look at when riders are using the bikes by summarizing based on the hour each ride starts.

starting_hour_summary <- all_2023_v2 %>% 
  group_by(member_casual, day_of_week, start_hour) %>% 
  summarize(number_of_rides=n(), average_duration = mean(ride_length)) %>%
  arrange(member_casual, day_of_week, start_hour) 
## `summarise()` has grouped output by 'member_casual', 'day_of_week'. You can
## override using the `.groups` argument.

With rides starting each hour combined on a single graph, we can see the commute spikes, especially for members.

ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = number_of_rides)) + 
  geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
  #geom_label(mapping = aes(label = number_of_rides)) + 
  labs(title = "Number of Rides starting each hour", x = "Hour", y = "Number of Rides", fill = "Rider Type", alt = "A graph with columns showing the number of rides starting each hour of the day.  Each hour has one column for members and another for casual riders." )  

Here is how the number of rides breaks down by starting hour on each day of the week. Remember that day 1 is Sunday.

ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = number_of_rides)) + 
  geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
  #geom_label(mapping = aes(label = number_of_rides)) + 
  labs(title = "Number of Rides starting each hour", x = "Hour", y = "Number of Rides", fill = "Rider Type", alt = "A collections of graphs with columns showing the number of starting each hour of the day, one graph for each weekday.  Each hour has one column for members and another for casual riders." )  + 
  facet_wrap(~day_of_week)

We can also look at behaviour of how long rides are based on what hour they started.

ggplot(data = starting_hour_summary, mapping = aes(x = start_hour, y = average_duration)) + 
  geom_col(position = "dodge", mapping = aes(fill = member_casual)) +
  #geom_label(mapping = aes(label = number_of_rides)) + 
  labs(title = "Average duration of rides starting each hour", x = "Hour", y = "Average duration (in seconds)", fill = "Rider Type", alt = "A graph with columns showing the average duraion of rides starting each hour of the day.  Each hour has one column for members and another for casual riders." )  

Honestly this makes me think that casual riders who use a bike to get home after a night out fail to close out their ride properly which skews our average data.

Summary of findings

We see that members take a lot more rides than casual riders (between 2 and 3 times the amount), and casual riders take longer rides on average (which may be skewed by some extremely long outliers, but the median supports this general trend).
When we break it down by day, we can see that on weekends the casual riders take more rides and members take fewer, bringing their numbers almost together, and both are taking longer rides on average than weekdays. And when we break rides down by hour started, we can see the biggest difference is the member rides spiking around commuting hours. There is a definite possibility that casual riders who take bikes home after a late night out do not park them correctly and thus skew the average ride length with outliers that do not reflect how long the bike was actually ridden (rides that start midnight - 4am).

Further Analysis

What bike types did riders use? Are there locations where casual rides tend to start/end?

Ad idea

All the benefits of owning a bike (Show commuting, exercise outside in beautiful weather, meeting friends without needing to park, could segue into downsides of cars - parking, gas, check engine light, pollution…) with none of the hassle (show: bike storage, oiling the chain, bike stolen, mechanic?)