njtierney/naniar

introduce functions to work with spark

Open

#40 opened on 2017年1月5日

GitHub で見る
 (3 comments) (0 reactions) (0 assignees)R (557 stars) (43 forks)batch import
Ideaenhancementhelp wanted

説明

@MilesMcBain mentioned that it is difficult to work with missing data within spark.

For example:

library(sparklyr)
library(tibble)
library(dplyr)

dat <- tribble(
    ~A, ~B, ~C,
    NA,  1,  1,
    1,  1, NA,
    NA, NA,  1,
    NA, NA, NA,
    1,  1,  1 
)

sc <- spark_connect(master = "local")
spark_dat <- copy_to(sc, dat)

#A crappy non-scalable way to do complete.cases
complete_cases <- 
    spark_dat %>% 
    filter(!is.na(A) & !is.na(B) & !is.na(C)) %>%
    collect()

#A crappy non-scalable way to do find rows with any na 
any_na <-
    spark_dat %>%
    filter(!(!is.na(A) & !is.na(B) & !is.na(C))) %>%
    collect()

It would be great to have naniar functions that also worked with spark.

Not sure how much work this would involve, but it looks like rstudio have a pretty nice extension API.

Just a thought for now, there's a lot of other things that I want to finish up first, but this should be on my roadmap towards version 1.0.0

コントリビューターガイド