r - dplyr and checking previous 24-60 months for observations -
r - dplyr and checking previous 24-60 months for observations -
i have bunch of unique cusip codes(unique id) , need check create sure there 24-60 previous month observations, not sure how check dplyr
tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975" ), datadate = c(19711231l, 19710129l, 19710226l, 19710331l, 19710430l, 19710528l, 19710630l, 19710730l, 19710831l, 19710930l, 19711029l, 19711130l, 19721231l, 19720131l, 19720229l, 19720330l, 19720428l, 19720531l, 19720630l, 19720731l, 19720831l, 19720929l, 19721031l, 19721130l, 19721229l, 19731231l, 19730131l, 19730228l, 19730330l, 19730430l, 19730531l, 19730629l, 19730731l, 19730831l, 19730928l, 19731031l, 19731130l, 19741231l, 19740131l, 19740228l, 19740329l, 19740430l, 19740531l, 19740628l, 19740731l, 19740830l, 19740930l, 19741031l, 19741129l, 19751231l, 19750131l, 19750228l, 19750331l, 19750430l, 19750530l, 19750630l, 19750731l, 19750829l, 19750930l, 19751031l)), .names = c("cusip", "fyear", "datadate"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(na, -60l), vars = list( cusip, fyear), drop = true, indices = list(0:11, 12:24, 25:36, 37:48, 49:59), group_sizes = c(12l, 13l, 12l, 12l, 11l), biggest_group_size = 13l, labels = structure(list( cusip = c(2, 2, 2, 2, 2), fyear = c("1971", "1972", "1973", "1974", "1975")), class = "data.frame", row.names = c(na, -5l), .names = c("cusip", "fyear"), vars = list(cusip, fyear)))
logic i thinking checking total months each year, don't know know extract previous months check if 24/60 >= 0.4. how edited code check previous 60 months , ensure there @ to the lowest degree 24 months including....
tdata %>% group_by(cusip, fyear) %>% mutate(month = substr(datadate, 5, 6) %>% mutate(pre_countmonths = length(unique(month))
edit 04/07/2015 : here logic next loops. 1 of challenges i'm having r branching outside of loops. possible way edit dplyr
can utilized instead of for
loops? take exclusively long run current data.
for(i in min(tdata$cusip):max(tdata$cusip)){ (j in min(tdata$fyear):max(tdata$fyear) { monthcheck <- filter(tdata, cusip == & (fyear == j-1 | fyear == j-2 | fyear == j-3 | fyear == j-4)) if(length(monthcheck$month) / 40 >= 0.4) if(any(tdata$fyear == j)) tdata$check <- 1 }}
edit : 04/08/2015 - added total sample dataset main variables small subset : https://www.dropbox.com/s/mf0o0tbgbame6k8/testdata.csv?dl=0
this got within time limit. hope gives ideas, , other users provide improve solutions.
mydf <- as_data_frame(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975" ), datadate = c(19711231l, 19710129l, 19710226l, 19710331l, 19710430l, 19710528l, 19710630l, 19710730l, 19710831l, 19710930l, 19711029l, 19711130l, 19721231l, 19720131l, 19720229l, 19720330l, 19720428l, 19720531l, 19720630l, 19720731l, 19720831l, 19720929l, 19721031l, 19721130l, 19721229l, 19731231l, 19730131l, 19730228l, 19730330l, 19730430l, 19730531l, 19730629l, 19730731l, 19730831l, 19730928l, 19731031l, 19731130l, 19741231l, 19740131l, 19740228l, 19740329l, 19740430l, 19740531l, 19740628l, 19740731l, 19740830l, 19740930l, 19741031l, 19741129l, 19751231l, 19750131l, 19750228l, 19750331l, 19750430l, 19750530l, 19750630l, 19750731l, 19750829l, 19750930l, 19751031l))) # create normal data.frame mydf <- data.frame(mydf) # create info frame new cusip mydf2 <- mutate(mydf, cusip = 3) ### create new info frame missing 1 info point foo <- bind_rows(mydf, mydf2[-4, ])
in pseudo data, cusip 3 missing 1 month data. means, not have consecutive 24-60-month info cusip
3. first, created column month , column date object. then, ordered info cusp
, datadate. wanted select info points remain between 24-60 month period. first filter
part. grouped info cusp
. using month, wanted check if have consecutive info points or not. expect lead(month)-month
= 1, 11, or 0. if have 2 info points same month, expect 0. happening in data. final filter
can revise. here, wanted remove cusip
got false in check. in draft, filter seems doing right thing; not see info of cusip 3 in end. hope helps you.
mutate(foo, month = as.numeric(substr(datadate, 5, 6))) %>% mutate(datadate = as.posixct(gsub("^(\\d{4})(\\d{2}).*$", "\\1-\\2-01", datadate), format("%y-%m-%d"), tz = "gmt")) %>% arrange(cusip, datadate) %>% filter(between(datadate, datadate[tail(which(month == 6, arr.ind = true), n = 1)] - (60*60*24*30*60), datadate[tail(which(month == 6, arr.ind = true), n = 1)] -(60*60*24*30*24))) %>% group_by(cusip) %>% mutate(check = abs(lead(month)-month) == 11|abs(lead(month)-month) == 1|abs(lead(month)-month) == 0) %>% filter(all(check == true | check %in% na))
r dplyr
Comments
Post a Comment