introduce functions to work with spark · njtierney/naniar#40

(3 comments) (0 reactions) (0 assignees)R (557 stars) (43 forks)batch import

Ideaenhancementhelp wanted

説明

@MilesMcBain mentioned that it is difficult to work with missing data within spark.

For example:

library(sparklyr)
library(tibble)
library(dplyr)

dat <- tribble(
    ~A, ~B, ~C,
    NA,  1,  1,
    1,  1, NA,
    NA, NA,  1,
    NA, NA, NA,
    1,  1,  1 
)

sc <- spark_connect(master = "local")
spark_dat <- copy_to(sc, dat)

#A crappy non-scalable way to do complete.cases
complete_cases <- 
    spark_dat %>% 
    filter(!is.na(A) & !is.na(B) & !is.na(C)) %>%
    collect()

#A crappy non-scalable way to do find rows with any na 
any_na <-
    spark_dat %>%
    filter(!(!is.na(A) & !is.na(B) & !is.na(C))) %>%
    collect()

It would be great to have naniar functions that also worked with spark.

Not sure how much work this would involve, but it looks like rstudio have a pretty nice extension API.

Just a thought for now, there's a lot of other things that I want to finish up first, but this should be on my roadmap towards version 1.0.0

コントリビューターガイド

技術スタック: なし
領域: data
Issue 種別: feature
難度: 3
推定時間: 1-2 days
活動状況: stale
明確さ: mostly clear
前提条件: R programmingfamiliarity with naniarknowledge of missing data handlingsparklyr basics
初心者向け度: 25
調査方針: The issue requests creating naniar functions compatible with Spark, as discussed in the issue body with example code using sparklyr. To begin, review the sparklyr extension API at http://spark.rstudio.com/extensions.html. Check the naniar repository for existing infrastructure and consider how functions like 'any na' or 'complete cases' could be implemented for Spark data frames. Look for any related comments in the issue or linked discussions that might provide further direction.