Downsample high-dimensional cytometry data by randomly selecting a proportion of the cells in each group.
Source:R/downsampling.R
tof_downsample_density.Rd
This function downsamples the number of cells in a `tof_tbl` using the density-dependent downsampling algorithm described in Qiu et al., (2011).
Usage
tof_downsample_density(
tof_tibble,
group_cols = NULL,
density_cols = where(tof_is_numeric),
target_num_cells,
target_prop_cells,
target_percentile = 0.03,
outlier_percentile = 0.01,
distance_function = c("euclidean", "cosine", "l2", "ip"),
density_estimation_method = c("mean_distance", "sum_distance", "spade"),
...
)
Arguments
- tof_tibble
A `tof_tbl` or a `tibble`.
- group_cols
Unquoted names of the columns in `tof_tibble` that should be used to define groups within which the downsampling will be performed. Supports tidyselect helpers. Defaults to `NULL` (no grouping).
- density_cols
Unquoted names of the columns in `tof_tibble` to use in the density estimation for each cell. Defaults to all numeric columns in `tof_tibble`.
- target_num_cells
An approximate constant number of cells (between 0 and 1) that should be sampled from each group defined by `group_cols`. Slightly more or fewer cells may be returned due to how the density calculation is performed.
- target_prop_cells
An approximate proportion of cells (between 0 and 1) that should be sampled from each group defined by `group_cols`. Slightly more or fewer cells may be returned due to how the density calculation is performed. Ignored if `target_num_cells` is specified.
- target_percentile
The local density percentile (i.e. a value between 0 and 1) to which the downsampling procedure should adjust all cells. In short, the algorithm will continue to remove cells from the input `tof_tibble` until the local densities of all remaining cells is equal to `target_percentile`. Lower values will result in more cells being removed. See Qiu et al., (2011) for details. Defaults to 0.1 (the 10th percentile of local densities). Ignored if either `target_num_cells` or `target_prop_cells` are specified.
- outlier_percentile
The local density percentile (i.e. a value between 0 and 1) below which cells should be considered outliers (and discarded). Cells with a local density below `outlier_percentile` will never be selected during the downsampling procedure. Defaults to 0.01 (cells below the 1st local density percentile will be removed).
- distance_function
A string indicating which distance function to use for the cell-to-cell distance calculations. Options include "euclidean" (the default) and "cosine" distances.
- density_estimation_method
A string indicating which algorithm should be used to calculate the local density estimate for each cell. Options include k-nearest neighbor density estimation using the mean distance to a cell's k-nearest neighbors ("mean_distance"; the default), k-nearest neighbor density estimation using the summed distance to a cell's k nearest neighbors ("sum_distance") and counting the number of neighboring cells within a spherical radius around each cell as described in Qiu et al., 2011 ("spade"). While "spade" often produces the best results, it is slower than knn-density estimation methods.
- ...
Optional additional arguments to pass to
tof_knn_density
ortof_spade_density
.
Value
A `tof_tbl` with the same number of columns as the input `tof_tibble`, but fewer rows. The number of rows will depend on the chosen value of `target_percentile`, with fewer cells selected with lower values of `target_percentile`.
See also
Other downsampling functions:
tof_downsample()
,
tof_downsample_constant()
,
tof_downsample_prop()
Examples
sim_data <-
dplyr::tibble(
cd45 = rnorm(n = 1000),
cd38 = rnorm(n = 1000),
cd34 = rnorm(n = 1000),
cd19 = rnorm(n = 1000)
)
tof_downsample_density(
tof_tibble = sim_data,
density_cols = c(cd45, cd34, cd38),
target_prop_cells = 0.5,
density_estimation_method = "spade"
)
#> # A tibble: 490 × 4
#> cd45 cd38 cd34 cd19
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.941 2.45 -1.33 0.0428
#> 2 -0.645 0.355 -0.785 -2.05
#> 3 1.30 0.0177 0.366 1.66
#> 4 -2.68 1.13 -0.466 -1.96
#> 5 0.983 -0.351 0.455 0.222
#> 6 -0.0637 -0.574 0.634 1.01
#> 7 0.855 0.710 -0.259 -0.846
#> 8 2.06 -0.704 -0.659 0.559
#> 9 0.840 -0.249 -1.31 -0.152
#> 10 1.39 0.571 0.550 0.193
#> # ℹ 480 more rows
tof_downsample_density(
tof_tibble = sim_data,
density_cols = c(cd45, cd34, cd38),
target_num_cells = 200L,
density_estimation_method = "spade"
)
#> # A tibble: 193 × 4
#> cd45 cd38 cd34 cd19
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1.34 0.176 0.0138 -1.02
#> 2 -0.0637 -0.574 0.634 1.01
#> 3 -0.997 -1.07 -1.54 -1.19
#> 4 0.840 -0.249 -1.31 -0.152
#> 5 1.39 0.571 0.550 0.193
#> 6 -0.554 0.483 -0.564 0.727
#> 7 -0.821 -1.77 0.435 -1.43
#> 8 -0.135 -0.0761 -0.282 -0.808
#> 9 1.46 -0.855 0.179 -1.15
#> 10 -2.15 0.665 -1.18 -0.795
#> # ℹ 183 more rows
tof_downsample_density(
tof_tibble = sim_data,
density_cols = c(cd45, cd34, cd38),
target_num_cells = 200L,
density_estimation_method = "mean_distance"
)
#> # A tibble: 203 × 4
#> cd45 cd38 cd34 cd19
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.586 1.04 -0.392 -0.946
#> 2 0.855 0.710 -0.259 -0.846
#> 3 0.840 -0.249 -1.31 -0.152
#> 4 -0.821 -1.77 0.435 -1.43
#> 5 1.46 -0.855 0.179 -1.15
#> 6 1.13 -1.17 -0.925 1.29
#> 7 -1.57 -1.32 1.74 0.266
#> 8 0.809 -0.993 0.0404 -0.794
#> 9 -0.771 -0.864 -0.632 -0.416
#> 10 -0.569 -0.209 -0.00985 -0.393
#> # ℹ 193 more rows