Downsample high-dimensional cytometry data by randomly selecting a proportion of the cells in each group.

This function downsamples the number of cells in a `tof_tbl` using the density-dependent downsampling algorithm described in Qiu et al., (2011).

Usage

tof_downsample_density(
  tof_tibble,
  group_cols = NULL,
  density_cols = where(tof_is_numeric),
  target_num_cells,
  target_prop_cells,
  target_percentile = 0.03,
  outlier_percentile = 0.01,
  distance_function = c("euclidean", "cosine", "l2", "ip"),
  density_estimation_method = c("mean_distance", "sum_distance", "spade"),
  ...
)

Arguments

tof_tibble: A `tof_tbl` or a `tibble`.
group_cols: Unquoted names of the columns in `tof_tibble` that should be used to define groups within which the downsampling will be performed. Supports tidyselect helpers. Defaults to `NULL` (no grouping).
density_cols: Unquoted names of the columns in `tof_tibble` to use in the density estimation for each cell. Defaults to all numeric columns in `tof_tibble`.
target_num_cells: An approximate constant number of cells (between 0 and 1) that should be sampled from each group defined by `group_cols`. Slightly more or fewer cells may be returned due to how the density calculation is performed.
target_prop_cells: An approximate proportion of cells (between 0 and 1) that should be sampled from each group defined by `group_cols`. Slightly more or fewer cells may be returned due to how the density calculation is performed. Ignored if `target_num_cells` is specified.
target_percentile: The local density percentile (i.e. a value between 0 and 1) to which the downsampling procedure should adjust all cells. In short, the algorithm will continue to remove cells from the input `tof_tibble` until the local densities of all remaining cells is equal to `target_percentile`. Lower values will result in more cells being removed. See Qiu et al., (2011) for details. Defaults to 0.1 (the 10th percentile of local densities). Ignored if either `target_num_cells` or `target_prop_cells` are specified.
outlier_percentile: The local density percentile (i.e. a value between 0 and 1) below which cells should be considered outliers (and discarded). Cells with a local density below `outlier_percentile` will never be selected during the downsampling procedure. Defaults to 0.01 (cells below the 1st local density percentile will be removed).
distance_function: A string indicating which distance function to use for the cell-to-cell distance calculations. Options include "euclidean" (the default) and "cosine" distances.
density_estimation_method: A string indicating which algorithm should be used to calculate the local density estimate for each cell. Options include k-nearest neighbor density estimation using the mean distance to a cell's k-nearest neighbors ("mean_distance"; the default), k-nearest neighbor density estimation using the summed distance to a cell's k nearest neighbors ("sum_distance") and counting the number of neighboring cells within a spherical radius around each cell as described in Qiu et al., 2011 ("spade"). While "spade" often produces the best results, it is slower than knn-density estimation methods.
...: Optional additional arguments to pass to tof_knn_density or tof_spade_density.

Value

A `tof_tbl` with the same number of columns as the input `tof_tibble`, but fewer rows. The number of rows will depend on the chosen value of `target_percentile`, with fewer cells selected with lower values of `target_percentile`.

Examples

sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_prop_cells = 0.5,
    density_estimation_method = "spade"
)
#> # A tibble: 490 × 4
#>       cd45    cd38   cd34    cd19
#>      <dbl>   <dbl>  <dbl>   <dbl>
#>  1 -0.941   2.45   -1.33   0.0428
#>  2 -0.645   0.355  -0.785 -2.05  
#>  3  1.30    0.0177  0.366  1.66  
#>  4 -2.68    1.13   -0.466 -1.96  
#>  5  0.983  -0.351   0.455  0.222 
#>  6 -0.0637 -0.574   0.634  1.01  
#>  7  0.855   0.710  -0.259 -0.846 
#>  8  2.06   -0.704  -0.659  0.559 
#>  9  0.840  -0.249  -1.31  -0.152 
#> 10  1.39    0.571   0.550  0.193 
#> # ℹ 480 more rows

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_num_cells = 200L,
    density_estimation_method = "spade"
)
#> # A tibble: 193 × 4
#>       cd45    cd38    cd34   cd19
#>      <dbl>   <dbl>   <dbl>  <dbl>
#>  1  1.34    0.176   0.0138 -1.02 
#>  2 -0.0637 -0.574   0.634   1.01 
#>  3 -0.997  -1.07   -1.54   -1.19 
#>  4  0.840  -0.249  -1.31   -0.152
#>  5  1.39    0.571   0.550   0.193
#>  6 -0.554   0.483  -0.564   0.727
#>  7 -0.821  -1.77    0.435  -1.43 
#>  8 -0.135  -0.0761 -0.282  -0.808
#>  9  1.46   -0.855   0.179  -1.15 
#> 10 -2.15    0.665  -1.18   -0.795
#> # ℹ 183 more rows

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_num_cells = 200L,
    density_estimation_method = "mean_distance"
)
#> # A tibble: 203 × 4
#>      cd45   cd38     cd34   cd19
#>     <dbl>  <dbl>    <dbl>  <dbl>
#>  1  0.586  1.04  -0.392   -0.946
#>  2  0.855  0.710 -0.259   -0.846
#>  3  0.840 -0.249 -1.31    -0.152
#>  4 -0.821 -1.77   0.435   -1.43 
#>  5  1.46  -0.855  0.179   -1.15 
#>  6  1.13  -1.17  -0.925    1.29 
#>  7 -1.57  -1.32   1.74     0.266
#>  8  0.809 -0.993  0.0404  -0.794
#>  9 -0.771 -0.864 -0.632   -0.416
#> 10 -0.569 -0.209 -0.00985 -0.393
#> # ℹ 193 more rows

Downsample high-dimensional cytometry data by randomly selecting a proportion of the cells in each group.

Usage

Arguments

Value

See also

Examples