These statistics are based on the data at http://achillesblog.com/atrpt.php copied into a spreadsheet 6/9/14. A cleaner solution would be to use the original data. Initially read most data as strings because fields need different conversions.
Had some trouble with ÿ (character 0xFF, from -1 for EOF?) appearing before days in CSV status fields. Easiest fix was a global replace in the CSV file.
I wish the data included op/non-op so we could break the statistics down by that.
Would also be useful to have rerupture data, but unlikely that would be reliable.
library(lattice) # For densityplot
## Warning: package 'lattice' was built under R version 3.0.3
timeline <- read.csv("C:/Users/rseiter/Documents/Health/Achilles/Statistics/AchillesTimeline.csv",
stringsAsFactors = FALSE)
Convert data to more usable formats. The hardest part is converting days fields into actual days (original data would be easier).
# First save the original data to allow debugging of conversions
timelineOrig <- timeline
timeline$Name <- as.factor(timeline$Name)
timeline$Leg <- as.factor(timeline$Leg)
timeline$Gender <- as.factor(timeline$Gender)
timeline$Date.of.Injury <- as.Date(timeline$Date.of.Injury, format = "%m/%d/%Y")
timeline$Date.of.Surgery <- as.Date(timeline$Date.of.Surgery, format = "%m/%d/%Y")
summary(timeline)
## Name Location Leg Gender
## 123tech : 1 Length:2303 : 7 F: 747
## 19yrold : 1 Class :character B: 31 M:1556
## 1greeneye : 1 Mode :character L:1143
## 1shann : 1 R:1122
## 1shutinathlete: 1
## 1tufchick : 1
## (Other) :2297
## Age.when.Injured Activity.When.Injured Date.of.Injury
## Min. : 1.0 Length:2303 Min. :0208-09-02
## 1st Qu.:31.0 Class :character 1st Qu.:2009-09-01
## Median :38.0 Mode :character Median :2011-05-29
## Mean :37.8 Mean :2009-07-03
## 3rd Qu.:44.0 3rd Qu.:2012-09-19
## Max. :74.0 Max. :2103-07-20
## NA's :426 NA's :98
## Days.to.PWB Days.to.FWB Days.to.2.Shoes
## Length:2303 Length:2303 Length:2303
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Days.to.Phys.Ther Date.of.Surgery
## Length:2303 Min. :2000-06-11
## Class :character 1st Qu.:2009-10-17
## Mode :character Median :2011-07-02
## Mean :2011-05-10
## 3rd Qu.:2012-11-07
## Max. :2014-07-04
## NA's :24
# Year by itself is useful Date.of.Surgery seems like most reliable field to
# use to calculate Year Primarily because it is baseline for status changes
timeline$Year <- as.numeric(format(timeline$Date.of.Surgery, "%Y"))
# Cumbersome but works...
stringToDays <- function(str) {
valid = FALSE
result = 0
days <- suppressWarnings(as.numeric(sub(".*([0-9])\\s*day.*", "\\1", str,
perl = TRUE)))
if (!is.na(days) && days > 0) {
valid = TRUE
result = result + days
}
weeks <- suppressWarnings(as.numeric(sub("[^0-9]*([0-9]+)\\s*wk.*", "\\1",
str, perl = TRUE)))
if (!is.na(weeks) && weeks > 0) {
valid = TRUE
result = result + 7 * weeks
}
if (valid)
result else NA
}
# stringToDays('10 wks 1 day')
# How to vectorize this?
timeline$PWB <- sapply(timeline$Days.to.PWB, stringToDays)
timeline$FWB <- sapply(timeline$Days.to.FWB, stringToDays)
timeline$TwoShoes <- sapply(timeline$Days.to.2.Shoes, stringToDays)
timeline$PT <- sapply(timeline$Days.to.Phys.Ther, stringToDays)
summary(timeline[, c("PWB", "FWB", "TwoShoes", "PT")])
## PWB FWB TwoShoes PT
## Min. : 1.0 Min. : 1.0 Min. : 1 Min. : 1
## 1st Qu.: 18.0 1st Qu.: 32.0 1st Qu.: 49 1st Qu.: 29
## Median : 29.0 Median : 42.0 Median : 61 Median : 44
## Mean : 31.2 Mean : 46.4 Mean : 70 Mean : 51
## 3rd Qu.: 41.0 3rd Qu.: 55.0 3rd Qu.: 76 3rd Qu.: 58
## Max. :368.0 Max. :661.0 Max. :3733 Max. :3345
## NA's :956 NA's :1206 NA's :1419 NA's :1332
Modify data to improve presentation.
For example, surgery dates before 2007 can be removed to give more usable histograms. For now this type of change is done during viewing with only an example here.
cutoff <- as.Date("01/01/2007", format = "%m/%d/%Y")
timelineCopy <- subset(timeline, Date.of.Surgery > cutoff)
hist(timelineCopy$Date.of.Surgery, breaks = "months")
Initial looks at the data. This is used to guide conversion, manipulation, and final statistical work.
hist(timeline$Date.of.Injury, breaks = "months")
# There are problematic injury dates, take a look
dateOrd <- order(timeline$Date.of.Injury)
head(data.frame(string = timelineOrig$Date.of.Injury, date = timeline$Date.of.Injury)[dateOrd,
], 20)
## string date
## 2066 09/02/0208 0208-09-02
## 1259 11/01/1007 1007-11-01
## 1790 07/09/1009 1009-07-09
## 1655 10/1/1972 1972-10-01
## 1677 1/24/1987 1987-01-24
## 1151 6/24/1992 1992-06-24
## 2055 1/1/1993 1993-01-01
## 267 6/15/1994 1994-06-15
## 2279 3/5/1998 1998-03-05
## 147 7/5/1998 1998-07-05
## 65 7/1/2004 2004-07-01
## 675 12/23/2004 2004-12-23
## 1359 5/10/2005 2005-05-10
## 1133 10/1/2005 2005-10-01
## 1999 10/29/2005 2005-10-29
## 2278 4/19/2006 2006-04-19
## 717 6/1/2006 2006-06-01
## 1954 7/4/2006 2006-07-04
## 2277 7/13/2006 2006-07-13
## 2104 1/7/2007 2007-01-07
head(timeline[dateOrd, ], 10)
## Name Location Leg Gender
## 2066 laminateddog Smithtown, New Brunswick, Canada R M
## 1259 cyndide74 L F
## 1790 preet86 Salem R F
## 1655 yeti Canberra, ACT, AUSTRALIA L M
## 1677 littlefeeties St. Paul L F
## 1151 19yrold los angeles R M
## 2055 horsinround Northern WV L F
## 267 upstate2519 Spartanburg, SC R F
## 2279 nicole123 R F
## 147 indybucfan R M
## Age.when.Injured Activity.When.Injured Date.of.Injury
## 2066 1 Taekwon-Do 0208-09-02
## 1259 13 taking levaquin 1007-11-01
## 1790 NA walking drunk with high heels on 1009-07-09
## 1655 12 1972-10-01
## 1677 NA 1987-01-24
## 1151 NA 1992-06-24
## 2055 NA Horseback riding 1993-01-01
## 267 33 hiking/walking 1994-06-15
## 2279 NA 1998-03-05
## 147 36 1998-07-05
## Days.to.PWB Days.to.FWB Days.to.2.Shoes Days.to.Phys.Ther
## 2066 5 wks 5 days 6 wks
## 1259
## 1790
## 1655
## 1677
## 1151 9 wks 8 wks 6 days 9 wks
## 2055 6 wks 6 wks 5 days 7 wks 3 days 8 wks 1 day
## 267 1 wk 5 days 5 wks 8 wks 5 days 5 wks
## 2279
## 147
## Date.of.Surgery Year PWB FWB TwoShoes PT
## 2066 2008-09-10 2008 40 NA NA 42
## 1259 2011-03-31 2011 NA NA NA NA
## 1790 2009-07-09 2009 NA NA NA NA
## 1655 2009-12-23 2009 NA NA NA NA
## 1677 2009-11-13 2009 NA NA NA NA
## 1151 2011-06-27 2011 63 62 63 NA
## 2055 2008-09-24 2008 42 47 52 57
## 267 2013-07-31 2013 12 35 61 35
## 2279 2000-06-11 2000 NA NA NA NA
## 147 2013-12-26 2013 NA NA NA NA
hist(timeline$Date.of.Surgery, breaks = "months")
# Below is probably easier to do with lubridate
# http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
# hist(as.numeric(format(timelineCopy$Date.of.Surgery, '%j'))) # day
hist(as.numeric(format(timelineCopy$Date.of.Surgery, "%W")), breaks = 52, xlab = "Surgery week",
main = "Histogram of week in year of surgery") # week starting Monday
# Below is probably easier to do with lubridate
# http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
hist(as.numeric(format(timeline$Date.of.Injury, "%W")), breaks = 52, xlab = "Injury week",
main = "Histogram of week in year of injury") # week starting Monday
hist(timeline$Date.of.Injury, breaks = "months", xlim = c(as.Date("01/01/2008",
format = "%m/%d/%Y"), Sys.Date()), freq = TRUE, xlab = "", main = "Histogram by Date of Injury")
Look at histograms for each status. For example, days until PWB.
# Limit display of each status
maxPWB <- 17 * 7
maxFWB <- 20 * 7
maxTwoShoes <- 20 * 7
maxPT <- 20 * 7
hist(timeline$PWB[timeline$PWB < maxPWB], breaks = seq(0, maxPWB, 7), xlab = "Days",
main = "Histogram of Days Until PWB")
densityplot(timeline$PWB[timeline$PWB < maxPWB], xlab = "Days", main = "Density Plot of Days Until PWB")
hist(timeline$FWB[timeline$FWB < maxFWB], breaks = seq(0, maxFWB, 7), xlab = "Days",
main = "Histogram of Days Until FWB")
densityplot(timeline$FWB[timeline$FWB < maxFWB], xlab = "Days", main = "Density Plot of Days Until FWB")
hist(timeline$TwoShoes[timeline$TwoShoes < maxTwoShoes], breaks = seq(0, maxTwoShoes,
7), xlab = "Days", main = "Histogram of Days Until 2 Shoes")
densityplot(timeline$TwoShoes[timeline$TwoShoes < maxTwoShoes], xlab = "Days",
main = "Density Plot of Days Until 2 Shoes")
hist(timeline$PT[timeline$PT < maxPT], breaks = seq(0, maxPT, 7), xlab = "Days",
main = "Histogram of Days Until Physical Therapy")
densityplot(timeline$PT[timeline$PT < maxPT], xlab = "Days", main = "Density Plot of Days Until Physical Therapy")
summary(timeline$Age.when.Injured)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 31.0 38.0 37.8 44.0 74.0 426
hist(timeline$Age.when.Injured)
densityplot(timeline$Age.when.Injured)
With the advent of more rapid non-op protocols I suspect the time to milestones (especially PWB?) will have changed over the years. See if this is true.
See Exploratory Data Analysis week 2 for plotting examples.
# Limit display
maxPWB <- 17 * 7
maxFWB <- 20 * 7
maxTwoShoes <- 20 * 7
maxPT <- 20 * 7
# densityplot(~ PWB | Year, data=timeline[timeline$PWB < maxPWB &
# timeline$Year >= 2007,], auto.key=TRUE, # Does not help?
# layout=c(1,max(timeline$Year,na.rm=TRUE)-2006), xlab='Days', main='Density
# Plots of Days Until PWB by Year') bwplot(~ PWB | Year,
# data=timeline[timeline$PWB < maxPWB & timeline$Year >= 2007,],
# auto.key=TRUE, # Does not help?
# #layout=c(1,max(timeline$Year,na.rm=TRUE)-2006), xlab='Days',
# main='Density Plots of Days Until PWB by Year') # Alternatively, try
# ggplot faceted histogram library(ggplot2) qplot(PWB,
# data=timeline[timeline$PWB < maxPWB & timeline$Year >= 2007 &
# !is.na(timeline$PWB),], facets = Year ~ ., binwidth=7)
boxplot(PWB ~ Year, data = timeline, subset = timeline$PWB < maxPWB & timeline$Year >=
2007, xlab = "", ylab = "Days Until PWB", main = "Box Plots of Days Until PWB by Year")
boxplot(FWB ~ Year, data = timeline, subset = timeline$FWB < maxFWB & timeline$Year >=
2007, xlab = "", ylab = "Days Until PWB", main = "Box Plots of Days Until FWB by Year")
boxplot(TwoShoes ~ Year, data = timeline, subset = timeline$TwoShoes < maxTwoShoes &
timeline$Year >= 2007, xlab = "", ylab = "Days Until Two Shoes", main = "Box Plots of Days Until Two Shoes by Year")
boxplot(PT ~ Year, data = timeline, subset = timeline$PT < maxPT & timeline$Year >=
2007, xlab = "", ylab = "Days Until PT", main = "Box Plots of Days Until Physical Therapy by Year")
# Summary statistics by year
timelineSubset <- subset(timeline, timeline$Year >= 2007)
tapply(timelineSubset$PWB, timelineSubset$Year, summary)
## $`2008`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 22.0 32.0 33.6 42.0 87.0 104
##
## $`2009`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 20.0 31.0 32.0 41.5 89.0 113
##
## $`2010`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 21.0 31.0 33.6 42.0 102.0 139
##
## $`2011`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 16.0 27.0 29.8 39.0 114.0 147
##
## $`2012`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 17.5 28.0 31.9 41.5 368.0 180
##
## $`2013`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.0 16.0 27.0 28.9 38.0 91.0 172
##
## $`2014`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 15.0 21.0 25.1 29.0 181.0 75
tapply(timelineSubset$FWB, timelineSubset$Year, summary)
## $`2008`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 37.0 47.0 49.8 58.5 121.0 122
##
## $`2009`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 34 43 46 56 101 144
##
## $`2010`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.0 32.0 42.0 51.1 57.2 661.0 169
##
## $`2011`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 27.0 39.0 42.9 50.0 424.0 180
##
## $`2012`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 35.0 43.0 48.4 55.0 414.0 237
##
## $`2013`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.0 32.0 41.0 43.4 54.0 121.0 224
##
## $`2014`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.0 27.0 35.0 36.2 42.0 92.0 105
tapply(timelineSubset$TwoShoes, timelineSubset$Year, summary)
## $`2008`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.0 51.8 61.5 65.9 80.0 137.0 141
##
## $`2009`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 49.0 60.0 66.1 77.0 434.0 165
##
## $`2010`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 16 48 57 63 73 235 194
##
## $`2011`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.0 47.0 57.0 62.3 71.0 284.0 238
##
## $`2012`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29 52 61 101 76 3730 279
##
## $`2013`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.0 54.0 63.0 65.6 77.0 114.0 259
##
## $`2014`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.0 49.0 58.0 61.7 71.0 122.0 117
tapply(timelineSubset$PT, timelineSubset$Year, summary)
## $`2008`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.0 34.0 50.0 50.7 63.0 193.0 148
##
## $`2009`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 32.8 44.5 48.6 60.2 420.0 160
##
## $`2010`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2 32 48 72 58 3340 185
##
## $`2011`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.0 29.0 42.0 46.1 54.0 412.0 213
##
## $`2012`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 30.8 46.5 50.6 60.2 390.0 255
##
## $`2013`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.0 26.0 43.0 47.3 58.0 413.0 250
##
## $`2014`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.0 18.8 31.0 34.9 44.5 95.0 96
Time until PWB, FWB, and PT all show a decreasing trend by year, but time until Two Shoes is relatively constant.