thor.fineST
- class thor.fineST(image_path, name, spot_adata_path=None, st_dir=None, cell_features_list=None, cell_features_csv_path=None, genes_path=None, save_dir=None, recipe='gene', **kwargs)[source]
Bases:
object
Class for in silico cell gene expression inference
- Parameters:
image_path (
str
) – Path to the whole slide image which is aligned to the spatial transcriptomics.name (
str
) – Name of the sample.spot_adata_path (
str
, optional) –Path to the processed spatial transcriptomics data (e.g., from the Visium sequencing data) in the
.h5ad
format.The expression array (
.X
) and spots coordinates (.obsm["spatial"]
) are required. Expecting that.X
is lognormalized.One of
spot_adata_path
orst_dir
is needed. Ifspot_adata_path
is provided,st_dir
will be neglected.st_dir (
str
, optional) – Directory to the SpaceRanger output directory, where the count matrix and spatial directory are located.cell_features_csv_path (
str
, optional) – Path to the CSV file that stores the cell features. The first two columns are expected (exactly) to be the nuclei positions “x” and “y”.cell_features_list (
list
orNone
, optional) –- List of features to be used for generating the cell-cell graph.
The first two are expected (exactly) to be the nuclei positions “x” and “y”.
By default, if no external features are provided, those features
["x", "y", "mean_gray", "std_gray", "entropy_img", "mean_r", "mean_g", "mean_b", "std_r", "std_g", "std_b"]
are used.
genes_path (
str
, optional) –- Path to the file that contains a headless one column of the genes to be included.
The gene names or gene IDs should be consistent with the
self.adata.var_names
. IfNone
, the genes will be highly variable genes or set further byself.set_genes_for_prediction
.
save_dir (
str
orNone
, optional) – Path to the directory of saving fineST prediction results.recipe (
str
, optional) – Specifies the mode for predicting the gene expression. Valid options are:("gene", "reduced", "mix")
.**kwargs (
dict
, optional) – Keyword arguments for any additional attributes to be set for the class. This allows future loading of the saved json file to create a new instance of the class.
Methods
Create a deep copy of the instance.
Get a reduced set of genes which were used to train the VAE model.
Load a fineST instance from a JSON file.
load_generate_model
Load the user-input genes to be used for prediction.
Load parameters from a JSON file.
Load the predicted gene expression data and create an
anndata.AnnData
object.Load a pre-trained VAE model.
Predict gene expression using Markov graph diffusion.
Prepare the input for the fineST estimation.
Prepare the recipe for gene expression prediction.
Perform sanity checks on the input data and parameters.
Save the current state of the instance.
Set the path to the CSV file containing cell features.
Set the list of cell features to be used for graph construction.
Set genes to be used for prediction.
Set the parameters for the fineST estimation.
Train a VAE model for the spot-level transcriptome data.
Visualize the cell-cell network.
Write an AnnData object to disk in the results directory.
Write the current parameters to a JSON file.
- copy()[source]
Create a deep copy of the instance.
- Returns:
A new instance that is a deep copy of the current instance.
- Return type:
- get_reduced_genes(keep=0.9, min_mean_expression=0.5)[source]
Get a reduced set of genes which were used to train the VAE model. This is because the genes used for VAE training may not be reconstructed faithfully to the same extent. Therefore, we will use the genes with high reconstruction quality (measured by cosine similarity with the input gene expression). One should be aware that the genes used for VAE training (
self.adata.var.used_for_vae
) are not the same as the genes used for thor prediction in reduced mode (self.adata.var.used_for_reduced
; subset).- Parameters:
keep (
float
, optional) – Fraction of genes to keep based on their importance in the VAE model for thor prediction in reduced mode. The genes are ranked according to the VAE reconstruction quality (measured by cosine similarity with the input gene expression). Default is 0.9.min_mean_expression (
float
, optional) – Minimum mean expression for genes to be considered. Note the expression values are log-transformed normalized expression. Default is 0.5.
- Returns:
The selected genes are marked in
self.adata.var.used_for_reduced
.- Return type:
- load_params(json_path)[source]
Load parameters from a JSON file. The loaded parameters will be merged with the current parameters in
self.run_params
andself.graph_params
.- Parameters:
json_path (
str
) – Path to the JSON file containing the parameters.
- load_result(file_name, layer_name=None)[source]
Load the predicted gene expression data and create an
anndata.AnnData
object.- Parameters:
- Returns:
The loaded
anndata.AnnData
object.- Return type:
- load_vae_model(model_path=None)[source]
Load a pre-trained VAE model.
- Parameters:
model_path (
str
orNone
, optional) – Path to the directory containing the VAE model. The model should be saved in the encoder and decoder subdirectories, with the filenames {self.name}_VAE_encoder.h5 and {self.name}_VAE_decoder.h5.- Returns:
The VAE model is loaded into the instance as self.generate and the model path is stored in self.model_path.
- Return type:
- predict_gene_expression(**kwargs)[source]
Predict gene expression using Markov graph diffusion.
This method performs the following steps: 1. Updates parameters using
set_params()
2. Prepares the recipe based on the selected mode 3. Optionally performs burn-in if transcriptome effects are included 4. Initializes the cell graph and transition matrix 5. Runs the Markov graph diffusion process 6. Saves results and cleans up temporary files- Parameters:
**kwargs (
dict
) – Same parameters as accepted byset_params()
.- Returns:
Anndata object with predicted gene expression for cells.
- Return type:
See also
set_params()
For detailed parameter descriptions.
thor.markov_graph_diffusion.estimate_expression_markov_graph_diffusion()
For the underlying implementation.
- prepare_input(mapping_margin=10, spot_identifier='spot_barcodes')[source]
Prepare the input for the fineST estimation.
First, generate the cell-wise adata from the cell features and spot adata. In this step, the segmented cells will be read from the
self.cell_features_csv_path
and the outliers from the segmentation will be removed according to the distance between a cell and its nearest neighbor. Second, the spot gene expression is mapped to aligned nearest cells. Lastly, the spot heterogeneity will be computed using the image features for future construction of the cell-cell graph and the transition matrix.- Parameters:
mapping_margin (
int
orfloat
, optional) – Margin for mapping the spot gene expression to the cells. Default is 10, which will attempt to map cells which are within 10- spot radius of any spot (so almost all identified cells are mapped to nearest spots). Decrease this number if you would like to eliminate isolated cells.
- prepare_recipe()[source]
Prepare the recipe for gene expression prediction.
This function sets up the appropriate genes and parameters based on the selected recipe. Supported recipes are: - “gene”: use all the user-provided genes for prediction. The user-provided genes should be in the
self.adata.var.used_for_prediction
. - “reduced”: use the VAE genes for prediction. The VAE genes should be in theself.adata.var.used_for_vae
and used for prediction, ignoringself.genes
. - “mix”: use both the VAE genes and the rest of the user-provided genes for prediction. The VAE genes should be in theself.adata.var.used_for_vae
.- Returns:
The recipe-specific settings are applied to set
self.adata.var
columns for gene selection.- Return type:
- sanity_check()[source]
Perform sanity checks on the input data and parameters.
This function verifies that all necessary attributes and parameters are set correctly before running the fineST estimation.
- save(exclude=['generate', 'adata', 'conn_csr_matrix'])[source]
Save the current state of the instance. The saved JSON file can be used to create a new instance of the class.
- set_cell_features_csv_path(cell_features_csv_path=None)[source]
Set the path to the CSV file containing cell features.
- Parameters:
cell_features_csv_path (
str
orNone
, optional) – Path to the CSV file containing cell features. IfNone
, the cell features csv file will be obtained from the WSI, which includes nuclei segmentation and feature extraction.- Returns:
The file path is stored in
self.cell_features_csv_path
.- Return type:
- set_cell_features_list(cell_features_list=None)[source]
Set the list of cell features to be used for graph construction.
- Parameters:
cell_features_list (
list
orNone
, optional) – List of features to be used for generating the cell-cell graph. IfNone
, default features will be used: [“x”, “y”, “mean_gray”, “std_gray”, “entropy_img”, “mean_r”, “mean_g”, “mean_b”, “std_r”, “std_g”, “std_b”]- Returns:
The feature names are stored in the list
self.cell_features_list
.- Return type:
- set_genes_for_prediction(genes_selection_key='highly_variable')[source]
Set genes to be used for prediction.
- Parameters:
genes_selection_key (
str
, optional) –Key for gene selection in
self.adata.var
. Default: “highly_variable”Valid options:
”highly_variable”: Selects highly variable genes
”all”: Selects all genes (not recommended)
None
: Uses genes specified inself.genes
Any key in
self.adata.var
: Uses that key for selection
- Returns:
The selected genes are marked in
self.adata.var.used_for_prediction
.- Return type:
- set_params(**kwargs)[source]
Set the parameters for the fineST estimation.
This method allows you to configure both graph construction and Markov diffusion parameters. All parameters are optional and will update the default values.
Graph Construction Parameters
- n_neighbors
int
, optional Number of neighbors for cell-cell graph construction. Default is 5.
- obs_keys
list
orNone
, optional List of observation keys to use for graph construction. If
None
, uses default features.- reduced_dimension_transcriptome_obsm_key
str
, optional Key in
obsm
for reduced dimension representation. Default is “X_pca”.- reduced_dimension_transcriptome_obsm_dims
int
, optional Number of dimensions to use from the reduced representation. Default is 2.
- geom_morph_ratio
float
, optional Ratio between geometric and morphological distances. Default is 1.
- geom_constraint
float
, optional Constraint on geometric distances. Default is 0.
- snn_threshold
float
, optional Threshold for shared nearest neighbor graph. Default is 0.1.
- node_features_obs_list
list
, optional List of node features to use. Default is [“spot_heterogeneity”].
- balance_cell_quality
bool
, optional Whether to balance cell quality. Default is
False
.- bcq_IQR
tuple
, optional Interquartile range for balancing cell quality. Default is (0.15, 0.85).
Transition Matrix Parameters
- preferential_flow
bool
, optional Whether to use preferential flow in transition matrix. Default is
True
.- weigh_cells
bool
, optional Whether to weigh cells (nodes) by node features. The idea is to give more weight to the cells with higher quality (e.g. lower heterogeneity with surrounding cells). Default is
True
.- smoothing_scale
float
, optional Scale for smoothing when constructing the transition matrix. The lower the scale, the more self-transition (i.e. the more likely to stay in the same state). Default is 0.8.
The diffusion transition matrix is defined as:
\[T = I - \lambda \cdot K\]where \(I\) is the identity matrix, \(K\) is the connectivity matrix, and \(\lambda\) is the smoothing scale.
- inflation_percentage
float
orNone
, optional Percentage for reverse diffusion scale relative to the forward diffusion scale
smoothing_scale
. Default isNone
(no reverse diffusion). When enabled, the reverse diffusion will be performed right after the forward diffusion every iteration.The reverse diffusion transition matrix is defined as:
\[T_{rev} = I - \mu \cdot K\]where \(I\) is the identity matrix, \(K\) is the connectivity matrix, and \(\mu\) is the reverse diffusion scale. \(\mu = -\lambda \cdot (1 + \text{inflation_percentage} \cdot 100)\)
- conn_csr_matrix
scipy.sparse.csr_matrix
orNone
, optional Pre-computed connectivity matrix. Default is
None
.If provided, the connectivity matrix will be used and the other parameters will be ignored.
If
None
, the connectivity matrix will be computed from the cell-cell graph if it does not exist inadata.obsp
; else the existing connectivity matrix will be used.If “force”, the connectivity matrix will be computed regardless of whether it already exists.
Markov Diffusion Parameters
- initialize
bool
, optional Whether to initialize graph and transition matrix. Default is
True
.- burn_in_steps
int
, optional Number of steps for burn-in period. Default is 5.
- layer
str
orNone
, optional Layer in AnnData to use for gene expression. Default is
None
(use.X
).- is_rawCount
bool
, optional Whether the input data is raw counts. Default is
False
.- regulate_expression_mean
bool
, optional Whether to regulate expression mean. Default is
False
.- stochastic_expression_neighbors_level
str
, optional Level for stochastic expression neighbors (“spot” or “cell”). Default is “spot”.
- n_iter
int
, optional Number of iterations for Markov diffusion. Default is 20.
- conn_key
str
, optional Key for connectivity matrix in
.obsp
. Default is “snn”.- write_freq
int
, optional Frequency of writing results to disk. Default is 10.
- out_prefix
str
, optional Prefix for output files. Default is “y”.
- sample_predicted_expression_fluctuation_scale
float
, optional Scale for fluctuation in predicted expression. Default is 1.
- smooth_predicted_expression_steps
int
, optional Number of steps for smoothing predicted expression. Default is 0.
- save_chain
bool
, optional Whether to save the MCMC chain. Default is
False
.- n_jobs
int
, optional Number of parallel jobs to run. Default is 1.
- adjust_cell_network_by_transcriptome_scale
float
, optional Scale for adjusting cell network by transcriptome. Default is 0.
See also
predict_gene_expression()
For using these parameters in prediction.
thor.markov_graph_diffusion.markov_graph_diffusion_initialize()
For graph construction details.
thor.markov_graph_diffusion.estimate_expression_markov_graph_diffusion()
For Markov diffusion details.
- n_neighbors
- vae_training(vae_genes_set=None, min_mean_expression=0.1, **kwargs)[source]
Train a VAE model for the spot-level transcriptome data.
- Parameters:
vae_genes_set (
set
orNone
, optional) –- Set of genes to be used for VAE training.
If
None
, all the genes (adata.var.used_for_prediction
, which are specified in theprepare_input()
) with mean expression >min_mean_expression
will be used.
min_mean_expression (
float
, optional) – Minimum mean expression for the genes to be used for VAE training.kwargs (
dict
) – Keyword arguments for thethor.VAE.train_vae()
function.
- Return type:
See also
thor.VAE.train_vae()
For detailed parameter descriptions and usage.
- visualize_cell_network(**kwargs)[source]
Visualize the cell-cell network.
This function internally calls
thor.plotting.graph.plot_cell_graph()
to create the visualization.- Parameters:
**kwargs (
dict
) – Additional parameters for the visualization. Seethor.plotting.graph.plot_cell_graph()
for available options.- Returns:
The network visualization is displayed.
- Return type:
See also
thor.plotting.graph.plot_cell_graph()
For detailed parameter descriptions and usage.
- write_adata(file_name, ad)[source]
Write an AnnData object to disk in the results directory.
- Parameters:
file_name (
str
) – Relative path to the file to write. The file will be saved inself.save_dir
.ad (
anndata.AnnData
) – Cell-wise gene expression to save.