Reading and writing data
Timothy Keyes
2024-08-25
Source:vignettes/reading-and-writing-data.Rmd
reading-and-writing-data.Rmd
This vignette teaches you how to read CyTOF data into an R session from two common file formats in which CyTOF data is typically stored: Flow Cytometry Standard (FCS) and Comma-Separated Value (CSV) files.
Accessing the data for this vignette
tidytof comes bundled with several example mass
cytometry datasets. To access the raw FCS and CSV files containing these
data, use the tidytof_example_data
function. When called
with no arguments, tidytof_example_data
will return a
character vector naming the datasets contained in
tidytof:
tidytof_example_data()
#> [1] "aml" "ddpr" "ddpr_metadata.csv"
#> [4] "mix" "mix2" "phenograph"
#> [7] "phenograph_csv" "surgery"
The details of the datasets contained in each of these directories isn’t particularly important, but some basic information is as follows:
- aml - one FCS file containing myeloid cells from a healthy bone marrow and one FCS file containing myeloid cells from an AML patient bone marrow
- ddpr - two FCS files containing B-cell lineage cells from this paper
- mix - two FCS files with different CyTOF antigen panels (one FCS file from the “aml” directory and one from the “phenograph” directory)
- mix2 - three files with different CyTOF antigen panels and different file extensions (one FCS file from the “aml” directory and two CSV files from the “phenograph_csv directory)
- phenograph - three FCS files containing AML cells from this paper
- phenograph_csv - the same cells as in the “phenograph” directory, but stored in CSV files
- scaffold - three FCS files from this paper
- statistical_scaffold - three FCS files from this paper
- surgery - three FCS files from this paper
To obtain the file path for the directory containing each dataset,
call tidytof_example_data
with one of these dataset names
as its argument. For example, to obtain the directory for the phenograph
data, we would use the following command:
tidytof_example_data("phenograph")
#> [1] "/home/runner/work/_temp/Library/tidytof/extdata/phenograph"
Reading Data with tof_read_data
Using one of these directories (or any other directory containing
CyTOF data on your local machine), we can use tof_read_data
to read CyTOF data from raw files. Acceptable formats include FCS files
and CSV files. Importantly, tof_read_data
is smart enough
to read single FCS/CSV files or multiple FCS/CSV files depending on
whether its first argument (path
) leads to a single file or
to a directory of files.
Here, we can use tof_read_data
to read in all of the FCS
files in the “phenograph” example dataset bundled into
tidytof and store it in the phenograph
variable.
phenograph <-
tidytof_example_data("phenograph") %>%
tof_read_data()
phenograph %>%
class()
#> [1] "tof_tbl" "tbl_df" "tbl" "data.frame"
Regardless of the input data file type, tidytof reads
data into an extended tibble
class called a
tof_tbl
(pronounced “tof tibble”).
tof tibbles are an S3 class identical to tbl_df
, but
with one additional attribute (“panel”). tidytof stores
this additional attribute in tof_tbl
s because, in addition
to analyzing CyTOF data from individual experiments, CyTOF users often
want to compare panels between experiments to find common markers or to
compare which metals are associated with particular markers across
panels. To retrieve this panel information from a tof_tbl
,
use tof_get_panel
:
phenograph %>%
tof_get_panel()
#> # A tibble: 44 × 2
#> metals antigens
#> <chr> <chr>
#> 1 Time Time
#> 2 Cell_length Cell_length
#> 3 Ir191 DNA1
#> 4 Ir193 DNA2
#> # ℹ 40 more rows
A few additional notes about tof_tbl
s:
-
tof_tbl
s contains one cell per row and one CyTOF channel per column (to provide the data in its “tidy” format). -
tof_read_data
adds an additional column to the outputtof_tbl
encoding the name of the file from which each cell was read (the “file_name” column). - Because
tof_tbl
s inherit from thetbl_df
class, all methods available to tibbles are also available totof_tbl
s.
Using tibble methods with {tidytof}
tibbles
As an extension of the tbl_df
class,
tof_tbl
s get access to all dplyr and
tidyr for free. These can be useful for performing a
variety of common operations.
For example, the phenograph
object above has two columns
- PhenoGraph
and Condition
- that encode
categorical variables as numeric codes. We might be interested in
converting the types of these columns into strings to make sure that we
don’t accidentally perform any quantitative operations on them later.
Thus, dplyr’s useful mutate
method can be
applied to phenograph
to convert those two columns into
character vectors.
phenograph <-
phenograph %>%
# mutate the input tof_tbl
mutate(
PhenoGraph = as.character(PhenoGraph),
Condition = as.character(Condition)
)
phenograph %>%
# use dplyr's select method to show
# that the columns have been changed
select(where(is.character))
#> # A tibble: 300 × 3
#> file_name PhenoGraph Condition
#> <chr> <chr> <chr>
#> 1 H1_PhenoGraph_cluster1.fcs 7 7
#> 2 H1_PhenoGraph_cluster1.fcs 6 6
#> 3 H1_PhenoGraph_cluster1.fcs 9 9
#> 4 H1_PhenoGraph_cluster1.fcs 2 2
#> # ℹ 296 more rows
And note that the tof_tbl
class is preserved even after
these transformations.
Importantly, tof_read_data
uses an opinionated heuristic
to mine different keyword slots of input FCS file(s) and guess which
metals and antigens were used during data acquisition. Thus, when CSV
files are read using tof_read_data
, it is recommended to
use the panel_info
argument to provide the panel manually
(as CSV files, unlike FCS files, do not provide built-in metadata about
the columns they contain).
# when csv files are read, the tof_tibble's "panel"
# attribute will be empty by default
tidytof_example_data("phenograph_csv") %>%
tof_read_data() %>%
tof_get_panel()
#> # A tibble: 0 × 0
# to add a panel manually, provide it as a tibble
# to tof_read_data
phenograph_panel <-
phenograph %>%
tof_get_panel()
tidytof_example_data("phenograph_csv") %>%
tof_read_data(panel_info = phenograph_panel) %>%
tof_get_panel()
#> # A tibble: 44 × 2
#> antigens metals
#> <chr> <chr>
#> 1 Time Time
#> 2 Cell_length Cell_length
#> 3 DNA1 Ir191
#> 4 DNA2 Ir193
#> # ℹ 40 more rows
Writing data from a tof_tbl
to disk
Users may wish to store CyTOF data as FCS or CSV files after
transformation, concatenation, filtering, or other data processing. To
write single-cell data from a tof_tbl
into FCS or CSV
files, use tof_write_data
. To illustrate how to use this
verb, we use the tidytof’s built-in
phenograph_data
dataset.
data(phenograph_data)
print(phenograph_data)
#> # A tibble: 3,000 × 25
#> sample_name phenograph_cluster cd19 cd11b cd34 cd45 cd123 cd33 cd47
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 H1_PhenoGra… cluster1 -0.168 29.0 3.23 131. -0.609 1.21 13.0
#> 2 H1_PhenoGra… cluster1 1.65 4.83 -0.582 230. 2.53 -0.507 12.9
#> 3 H1_PhenoGra… cluster1 2.79 36.1 5.20 293. -0.265 3.67 27.1
#> 4 H1_PhenoGra… cluster1 0.0816 48.8 0.363 431. 2.04 9.40 41.0
#> # ℹ 2,996 more rows
#> # ℹ 16 more variables: cd7 <dbl>, cd44 <dbl>, cd38 <dbl>, cd3 <dbl>,
#> # cd117 <dbl>, cd64 <dbl>, cd41 <dbl>, pstat3 <dbl>, pstat5 <dbl>,
#> # pampk <dbl>, p4ebp1 <dbl>, ps6 <dbl>, pcreb <dbl>, `pzap70-syk` <dbl>,
#> # prb <dbl>, `perk1-2` <dbl>
# when copying and pasting this code, feel free to change this path
# to wherever you'd like to save your output files
my_path <- file.path("~", "Desktop", "tidytof_vignette_files")
phenograph_data %>%
tof_write_data(
group_cols = phenograph_cluster,
out_path = my_path,
format = "fcs"
)
tof_write_data
’s trickiest argument is
group_cols
, the argument used to specify which columns in
tof_tibble
should be used to group cells (the rows of
tof_tibble
) into separate FCS or CSV files. Simply put,
this argument allows tof_write_data
to create a single FCS
or CSV file for each unique combination of values in the
group_cols
columns specified by the user. In the example
above, cells are grouped into 3 output FCS files - one for each of the 3
clusters encoded by the phenograph_cluster
column in
phenograph_data
. These files should have the following
names (derived from the values in the phenograph_cluster
column):
- cluster1.fcs
- cluster2.fcs
- cluster3.fcs
Note that these file names match the distinct values in our
group_cols
column (phenograph_cluster
):
phenograph_data %>%
distinct(phenograph_cluster)
#> # A tibble: 3 × 1
#> phenograph_cluster
#> <chr>
#> 1 cluster1
#> 2 cluster2
#> 3 cluster3
However, suppose we wanted to write multiple files for each cluster
by breaking cells into two groups: those that express high levels of
pstat5
and those that express low levels of
pstat5
. We can use dplyr::mutate
to create a
new column in phenograph_data
that breaks cells into high-
and low-pstat5
expression groups, then add this column to
our group_cols
specification:
phenograph_data %>%
# create a variable representing if a cell is above or below
# the median expression level of pstat5
mutate(
expression_group = if_else(pstat5 > median(pstat5), "high", "low")
) %>%
tof_write_data(
group_cols = c(phenograph_cluster, expression_group),
out_path = my_path,
format = "fcs"
)
This will write 6 files with the following names (derived from the
values in phenograph_cluster
and
expression_group
).
- cluster1_low.fcs
- cluster1_high.fcs
- cluster2_low.fcs
- cluster2_high.fcs
- cluster3_low.fcs
- cluster3_high.fcs
As above, note that these file names match the distinct values in our
group_cols
columns (phenograph_cluster
and
expression_group
):
phenograph_data %>%
mutate(
expression_group = if_else(pstat5 > median(pstat5), "high", "low")
) %>%
distinct(phenograph_cluster, expression_group)
#> # A tibble: 6 × 2
#> phenograph_cluster expression_group
#> <chr> <chr>
#> 1 cluster1 low
#> 2 cluster1 high
#> 3 cluster2 low
#> 4 cluster2 high
#> # ℹ 2 more rows
A useful feature of tof_write_data
is that it will
automatically concatenate cells into single FCS or CSV files based on
the specified group_cols
regardless of how many unique
files those cells came from. This allows for easy concatenation of
FCS or CSV files containing data from a single sample acquired over
multiple CyTOF runs, for example.
Session info
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 tidytof_0.99.8
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
#> [4] matrixStats_1.3.0 compiler_4.4.1 systemfonts_1.1.0
#> [7] vctrs_0.6.5 stringr_1.5.1 crayon_1.5.3
#> [10] pkgconfig_2.0.3 shape_1.4.6.1 fastmap_1.2.0
#> [13] ggraph_2.2.1 utf8_1.2.4 rmarkdown_2.28
#> [16] prodlim_2024.06.25 tzdb_0.4.0 ragg_1.3.2
#> [19] bit_4.0.5 purrr_1.0.2 xfun_0.47
#> [22] glmnet_4.1-8 cachem_1.1.0 jsonlite_1.8.8
#> [25] recipes_1.1.0 tweenr_2.0.3 parallel_4.4.1
#> [28] R6_2.5.1 bslib_0.8.0 stringi_1.8.4
#> [31] parallelly_1.38.0 rpart_4.1.23 lubridate_1.9.3
#> [34] jquerylib_0.1.4 Rcpp_1.0.13 iterators_1.0.14
#> [37] knitr_1.48 future.apply_1.11.2 readr_2.1.5
#> [40] flowCore_2.16.0 Matrix_1.7-0 splines_4.4.1
#> [43] nnet_7.3-19 igraph_2.0.3 timechange_0.3.0
#> [46] tidyselect_1.2.1 yaml_2.3.10 viridis_0.6.5
#> [49] timeDate_4032.109 doParallel_1.0.17 codetools_0.2-20
#> [52] listenv_0.9.1 lattice_0.22-6 tibble_3.2.1
#> [55] Biobase_2.64.0 withr_3.0.1 evaluate_0.24.0
#> [58] future_1.34.0 desc_1.4.3 survival_3.6-4
#> [61] polyclip_1.10-7 pillar_1.9.0 foreach_1.5.2
#> [64] stats4_4.4.1 generics_0.1.3 vroom_1.6.5
#> [67] RcppHNSW_0.6.0 S4Vectors_0.42.1 hms_1.1.3
#> [70] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
#> [73] globals_0.16.3 class_7.3-22 glue_1.7.0
#> [76] tools_4.4.1 data.table_1.15.4 gower_1.0.1
#> [79] fs_1.6.4 graphlayouts_1.1.1 tidygraph_1.3.1
#> [82] grid_4.4.1 yardstick_1.3.1 tidyr_1.3.1
#> [85] RProtoBufLib_2.16.0 ipred_0.9-15 colorspace_2.1-1
#> [88] ggforce_0.4.2 cli_3.6.3 textshaping_0.4.0
#> [91] fansi_1.0.6 cytolib_2.16.0 viridisLite_0.4.2
#> [94] lava_1.8.0 gtable_0.3.5 sass_0.4.9
#> [97] digest_0.6.37 BiocGenerics_0.50.0 ggrepel_0.9.5
#> [100] htmlwidgets_1.6.4 farver_2.1.2 memoise_2.0.1
#> [103] htmltools_0.5.8.1 pkgdown_2.1.0 lifecycle_1.0.4
#> [106] hardhat_1.4.0 bit64_4.0.5 MASS_7.3-60.2