Ideaenhancementhelp wanted
Description
@MilesMcBain mentioned that it is difficult to work with missing data within spark.
For example:
library(sparklyr)
library(tibble)
library(dplyr)
dat <- tribble(
~A, ~B, ~C,
NA, 1, 1,
1, 1, NA,
NA, NA, 1,
NA, NA, NA,
1, 1, 1
)
sc <- spark_connect(master = "local")
spark_dat <- copy_to(sc, dat)
#A crappy non-scalable way to do complete.cases
complete_cases <-
spark_dat %>%
filter(!is.na(A) & !is.na(B) & !is.na(C)) %>%
collect()
#A crappy non-scalable way to do find rows with any na
any_na <-
spark_dat %>%
filter(!(!is.na(A) & !is.na(B) & !is.na(C))) %>%
collect()
It would be great to have naniar functions that also worked with spark.
Not sure how much work this would involve, but it looks like rstudio have a pretty nice extension API.
Just a thought for now, there's a lot of other things that I want to finish up first, but this should be on my roadmap towards version 1.0.0