-
Notifications
You must be signed in to change notification settings - Fork 26
Python quick start
TileDB-SOMA is available on PyPI and Conda, and can be installed via pip
or mamba
as indicated below. Full installation instructions can be found here.
python -m pip install tiledbsoma
mamba install -c conda-forge -c tiledb tiledbsoma-py
In case of illegal instruction
errors when running on older architectures --- e.g. Opteron, non-AVX2 --- the issue is that the pre-compiled binaries available at Conda or PyPI aren't targeted for all processor variants over time. You can use
git clone https://github.com/single-cell-data/TileDB-SOMA.git
pip install -v -e TileDB-SOMA/apis/python
to effect a local compile. You'll need cmake
on your system.
https://tiledbsoma.readthedocs.io/en/latest/python-api.html
SOMA objects can be created with their respective create()
methods and then need to be populated in specific ways depending on their types.
However, a SOMAExperiment
can be easily created from and anndata object or a *h5ad
file. Here, one is created from a *.h5ad
file.
import tiledbsoma.io
# Create and write a SOMA Experiment, source data https://github.com/chanzuckerberg/cellxgene/raw/main/example-dataset/pbmc3k.h5ad
pbmc3k_uri = tiledbsoma.io.from_h5ad("./pbmc3k", input_path = "pbmc3k.h5ad", measurement_name = "RNA")
SOMA objects can be opened using tildedbsoma.open()
.
The contents of DataFrame
, SparseNDArray
and DenseNDArray
can be accessed with their respective read()
methods. For DataFrame
and SparseNDArray
the method returns an iterator useful for larger-than-memory operations.
For example you can open the SOMAExperiment
created above and then read the contents of obs
which is a SOMADataFrame
.
In addition, this example shows how you can query for observations with louvian
values of 'Megakaryocytes' and 'CD4 T cells', and n_genes
greater than 500.
import tiledbsoma
with tiledbsoma.open(pbmc3k_uri) as pbmc3k_soma:
pbmc3k_obs_slice = pbmc3k_soma.obs.read(
value_filter="n_genes >500 and louvain in ['Megakaryocytes', 'CD4 T cells']"
)
# Concatenate iterator to pyarrow.Table
pbmc3k_obs_slice.concat()
The result is a pyarrow.Table
containing a slice based on the specified filters.
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string
----
soma_joinid: [[0,2,8,11,12,...,2617,2621,2626,2631,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGATCAGC-1","AAACGCTGTAGCCA-1","AAACTTGATCCAGA-1","AAAGAGACGAGATA-1",...,"TTGTAGCTAGCTCA-1","TTTAGCTGATACCG-1","TTTCACGAGGTTCA-1","TTTCCAGAGGTGAG-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1131,533,751,866,...,933,887,721,873,724]]
percent_mito: [[0.030177759,0.008897362,0.011764706,0.010887772,0.010788382,...,0.02224871,0.022875817,0.013261297,0.0068587107,0.008064516]]
n_counts: [[2419,3147,1275,2388,2410,...,2517,2754,2036,2187,1984]]
louvain: [["CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells",...,"CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells"]]
As stated above the read()
methods of DataFrame
and SparseNDArray
return an iterator. The batch size can be specified a in the soma.init_buffer_bytes
config option, for this is example it is set to 100 Bytes:
context = tiledbsoma.options.SOMATileDBContext()
context = context.replace(tiledb_config = {"soma.init_buffer_bytes": 100})
with tiledbsoma.open(pbmc3k_uri, context = context) as pbmc3k_soma:
pbmc3k_obs = pbmc3k_soma.obs.read()
counter = 1
for pbmc3k_obs_chunk in pbmc3k_obs:
# Perform operations
# pbmc3k_obs_chunk is a pyArrow.Table
counter += 1
print(counter)
The counter indicates the number of iterations performed
441