thor.fineST

class thor.fineST(image_path, name, spot_adata_path=None, st_dir=None, cell_features_list=None, cell_features_csv_path=None, genes_path=None, save_dir=None, recipe='gene', **kwargs)[source]

Bases: object

Class for in silico cell gene expression inference

Parameters:
  • image_path (str) – Path to the whole slide image which is aligned to the spatial transcriptomics.

  • name (str) – Name of the sample.

  • spot_adata_path (str, optional) –

    Path to the processed spatial transcriptomics data (e.g., from the Visium sequencing data) in the .h5ad format.

    The expression array (.X) and spots coordinates (.obsm["spatial"]) are required. Expecting that .X is lognormalized.

    One of spot_adata_path or st_dir is needed. If spot_adata_path is provided, st_dir will be neglected.

  • st_dir (str, optional) – Directory to the SpaceRanger output directory, where the count matrix and spatial directory are located.

  • cell_features_csv_path (str, optional) – Path to the CSV file that stores the cell features. The first two columns are expected (exactly) to be the nuclei positions “x” and “y”.

  • cell_features_list (list or None, optional) –

    List of features to be used for generating the cell-cell graph.

    The first two are expected (exactly) to be the nuclei positions “x” and “y”.

    By default, if no external features are provided, those features ["x", "y", "mean_gray", "std_gray", "entropy_img", "mean_r", "mean_g", "mean_b", "std_r", "std_g", "std_b"] are used.

  • genes_path (str, optional) –

    Path to the file that contains a headless one column of the genes to be included.

    The gene names or gene IDs should be consistent with the self.adata.var_names. If None, the genes will be highly variable genes or set further by self.set_genes_for_prediction.

  • save_dir (str or None, optional) – Path to the directory of saving fineST prediction results.

  • recipe (str, optional) – Specifies the mode for predicting the gene expression. Valid options are: ("gene", "reduced", "mix").

  • **kwargs (dict, optional) – Keyword arguments for any additional attributes to be set for the class. This allows future loading of the saved json file to create a new instance of the class.

Methods

copy

Create a deep copy of the instance.

get_reduced_genes

Get a reduced set of genes which were used to train the VAE model.

load

Load a fineST instance from a JSON file.

load_generate_model

load_genes

Load the user-input genes to be used for prediction.

load_params

Load parameters from a JSON file.

load_result

Load the predicted gene expression data and create an anndata.AnnData object.

load_vae_model

Load a pre-trained VAE model.

predict_gene_expression

Predict gene expression using Markov graph diffusion.

prepare_input

Prepare the input for the fineST estimation.

prepare_recipe

Prepare the recipe for gene expression prediction.

sanity_check

Perform sanity checks on the input data and parameters.

save

Save the current state of the instance.

set_cell_features_csv_path

Set the path to the CSV file containing cell features.

set_cell_features_list

Set the list of cell features to be used for graph construction.

set_genes_for_prediction

Set genes to be used for prediction.

set_params

Set the parameters for the fineST estimation.

vae_training

Train a VAE model for the spot-level transcriptome data.

visualize_cell_network

Visualize the cell-cell network.

write_adata

Write an AnnData object to disk in the results directory.

write_params

Write the current parameters to a JSON file.

copy()[source]

Create a deep copy of the instance.

Returns:

A new instance that is a deep copy of the current instance.

Return type:

fineST

get_reduced_genes(keep=0.9, min_mean_expression=0.5)[source]

Get a reduced set of genes which were used to train the VAE model. This is because the genes used for VAE training may not be reconstructed faithfully to the same extent. Therefore, we will use the genes with high reconstruction quality (measured by cosine similarity with the input gene expression). One should be aware that the genes used for VAE training (self.adata.var.used_for_vae) are not the same as the genes used for thor prediction in reduced mode (self.adata.var.used_for_reduced; subset).

Parameters:
  • keep (float, optional) – Fraction of genes to keep based on their importance in the VAE model for thor prediction in reduced mode. The genes are ranked according to the VAE reconstruction quality (measured by cosine similarity with the input gene expression). Default is 0.9.

  • min_mean_expression (float, optional) – Minimum mean expression for genes to be considered. Note the expression values are log-transformed normalized expression. Default is 0.5.

Returns:

The selected genes are marked in self.adata.var.used_for_reduced.

Return type:

None

classmethod load(json_path)[source]

Load a fineST instance from a JSON file.

Parameters:

json_path (str) – Path to the JSON file containing the saved instance.

Returns:

A new instance loaded from the JSON file.

Return type:

fineST

load_genes(genes_file_path)[source]

Load the user-input genes to be used for prediction.

Parameters:

genes_file_path (str) – Path to the csv file that contains the genes to be used for prediction. The genes should be in the first column of the csv file. Gene naming convention should match self.adata.var_names.

Returns:

The genes are loaded into the list self.genes.

Return type:

None

load_params(json_path)[source]

Load parameters from a JSON file. The loaded parameters will be merged with the current parameters in self.run_params and self.graph_params.

Parameters:

json_path (str) – Path to the JSON file containing the parameters.

load_result(file_name, layer_name=None)[source]

Load the predicted gene expression data and create an anndata.AnnData object.

Parameters:
  • file_name (str) – Relative path to the file to load. The file should be in self.save_dir.

  • layer_name (str or None, optional) – Name of the layer to save the loaded gene expression. If None, save the gene expression to adata.X.

Returns:

The loaded anndata.AnnData object.

Return type:

anndata.AnnData

load_vae_model(model_path=None)[source]

Load a pre-trained VAE model.

Parameters:

model_path (str or None, optional) – Path to the directory containing the VAE model. The model should be saved in the encoder and decoder subdirectories, with the filenames {self.name}_VAE_encoder.h5 and {self.name}_VAE_decoder.h5.

Returns:

The VAE model is loaded into the instance as self.generate and the model path is stored in self.model_path.

Return type:

None

predict_gene_expression(**kwargs)[source]

Predict gene expression using Markov graph diffusion.

This method performs the following steps: 1. Updates parameters using set_params() 2. Prepares the recipe based on the selected mode 3. Optionally performs burn-in if transcriptome effects are included 4. Initializes the cell graph and transition matrix 5. Runs the Markov graph diffusion process 6. Saves results and cleans up temporary files

Parameters:

**kwargs (dict) – Same parameters as accepted by set_params().

Returns:

Anndata object with predicted gene expression for cells.

Return type:

anndata.AnnData

See also

set_params()

For detailed parameter descriptions.

thor.markov_graph_diffusion.estimate_expression_markov_graph_diffusion()

For the underlying implementation.

prepare_input(mapping_margin=10, spot_identifier='spot_barcodes')[source]

Prepare the input for the fineST estimation.

First, generate the cell-wise adata from the cell features and spot adata. In this step, the segmented cells will be read from the self.cell_features_csv_path and the outliers from the segmentation will be removed according to the distance between a cell and its nearest neighbor. Second, the spot gene expression is mapped to aligned nearest cells. Lastly, the spot heterogeneity will be computed using the image features for future construction of the cell-cell graph and the transition matrix.

Parameters:

mapping_margin (int or float, optional) – Margin for mapping the spot gene expression to the cells. Default is 10, which will attempt to map cells which are within 10- spot radius of any spot (so almost all identified cells are mapped to nearest spots). Decrease this number if you would like to eliminate isolated cells.

prepare_recipe()[source]

Prepare the recipe for gene expression prediction.

This function sets up the appropriate genes and parameters based on the selected recipe. Supported recipes are: - “gene”: use all the user-provided genes for prediction. The user-provided genes should be in the self.adata.var.used_for_prediction. - “reduced”: use the VAE genes for prediction. The VAE genes should be in the self.adata.var.used_for_vae and used for prediction, ignoring self.genes. - “mix”: use both the VAE genes and the rest of the user-provided genes for prediction. The VAE genes should be in the self.adata.var.used_for_vae.

Returns:

The recipe-specific settings are applied to set self.adata.var columns for gene selection.

Return type:

None

sanity_check()[source]

Perform sanity checks on the input data and parameters.

This function verifies that all necessary attributes and parameters are set correctly before running the fineST estimation.

Returns:

Returns True if all checks pass, otherwise False.

Return type:

bool

save(exclude=['generate', 'adata', 'conn_csr_matrix'])[source]

Save the current state of the instance. The saved JSON file can be used to create a new instance of the class.

Parameters:

exclude (list of str, optional) – List of attributes to exclude from saving. Default is [“generate”, “adata”, “conn_csr_matrix”].

Returns:

The instance state is saved to a JSON file.

Return type:

None

set_cell_features_csv_path(cell_features_csv_path=None)[source]

Set the path to the CSV file containing cell features.

Parameters:

cell_features_csv_path (str or None, optional) – Path to the CSV file containing cell features. If None, the cell features csv file will be obtained from the WSI, which includes nuclei segmentation and feature extraction.

Returns:

The file path is stored in self.cell_features_csv_path.

Return type:

None

set_cell_features_list(cell_features_list=None)[source]

Set the list of cell features to be used for graph construction.

Parameters:

cell_features_list (list or None, optional) – List of features to be used for generating the cell-cell graph. If None, default features will be used: [“x”, “y”, “mean_gray”, “std_gray”, “entropy_img”, “mean_r”, “mean_g”, “mean_b”, “std_r”, “std_g”, “std_b”]

Returns:

The feature names are stored in the list self.cell_features_list.

Return type:

None

set_genes_for_prediction(genes_selection_key='highly_variable')[source]

Set genes to be used for prediction.

Parameters:

genes_selection_key (str, optional) –

Key for gene selection in self.adata.var. Default: “highly_variable”

Valid options:

  • ”highly_variable”: Selects highly variable genes

  • ”all”: Selects all genes (not recommended)

  • None: Uses genes specified in self.genes

  • Any key in self.adata.var: Uses that key for selection

Returns:

The selected genes are marked in self.adata.var.used_for_prediction.

Return type:

None

set_params(**kwargs)[source]

Set the parameters for the fineST estimation.

This method allows you to configure both graph construction and Markov diffusion parameters. All parameters are optional and will update the default values.

Graph Construction Parameters

n_neighborsint, optional

Number of neighbors for cell-cell graph construction. Default is 5.

obs_keyslist or None, optional

List of observation keys to use for graph construction. If None, uses default features.

reduced_dimension_transcriptome_obsm_keystr, optional

Key in obsm for reduced dimension representation. Default is “X_pca”.

reduced_dimension_transcriptome_obsm_dimsint, optional

Number of dimensions to use from the reduced representation. Default is 2.

geom_morph_ratiofloat, optional

Ratio between geometric and morphological distances. Default is 1.

geom_constraintfloat, optional

Constraint on geometric distances. Default is 0.

snn_thresholdfloat, optional

Threshold for shared nearest neighbor graph. Default is 0.1.

node_features_obs_listlist, optional

List of node features to use. Default is [“spot_heterogeneity”].

balance_cell_qualitybool, optional

Whether to balance cell quality. Default is False.

bcq_IQRtuple, optional

Interquartile range for balancing cell quality. Default is (0.15, 0.85).

Transition Matrix Parameters

preferential_flowbool, optional

Whether to use preferential flow in transition matrix. Default is True.

weigh_cellsbool, optional

Whether to weigh cells (nodes) by node features. The idea is to give more weight to the cells with higher quality (e.g. lower heterogeneity with surrounding cells). Default is True.

smoothing_scalefloat, optional

Scale for smoothing when constructing the transition matrix. The lower the scale, the more self-transition (i.e. the more likely to stay in the same state). Default is 0.8.

The diffusion transition matrix is defined as:

\[T = I - \lambda \cdot K\]

where \(I\) is the identity matrix, \(K\) is the connectivity matrix, and \(\lambda\) is the smoothing scale.

inflation_percentagefloat or None, optional

Percentage for reverse diffusion scale relative to the forward diffusion scale smoothing_scale. Default is None (no reverse diffusion). When enabled, the reverse diffusion will be performed right after the forward diffusion every iteration.

The reverse diffusion transition matrix is defined as:

\[T_{rev} = I - \mu \cdot K\]

where \(I\) is the identity matrix, \(K\) is the connectivity matrix, and \(\mu\) is the reverse diffusion scale. \(\mu = -\lambda \cdot (1 + \text{inflation_percentage} \cdot 100)\)

conn_csr_matrixscipy.sparse.csr_matrix or None, optional

Pre-computed connectivity matrix. Default is None.

  • If provided, the connectivity matrix will be used and the other parameters will be ignored.

  • If None, the connectivity matrix will be computed from the cell-cell graph if it does not exist in adata.obsp; else the existing connectivity matrix will be used.

  • If “force”, the connectivity matrix will be computed regardless of whether it already exists.

Markov Diffusion Parameters

initializebool, optional

Whether to initialize graph and transition matrix. Default is True.

burn_in_stepsint, optional

Number of steps for burn-in period. Default is 5.

layerstr or None, optional

Layer in AnnData to use for gene expression. Default is None (use .X).

is_rawCountbool, optional

Whether the input data is raw counts. Default is False.

regulate_expression_meanbool, optional

Whether to regulate expression mean. Default is False.

stochastic_expression_neighbors_levelstr, optional

Level for stochastic expression neighbors (“spot” or “cell”). Default is “spot”.

n_iterint, optional

Number of iterations for Markov diffusion. Default is 20.

conn_keystr, optional

Key for connectivity matrix in .obsp. Default is “snn”.

write_freqint, optional

Frequency of writing results to disk. Default is 10.

out_prefixstr, optional

Prefix for output files. Default is “y”.

sample_predicted_expression_fluctuation_scalefloat, optional

Scale for fluctuation in predicted expression. Default is 1.

smooth_predicted_expression_stepsint, optional

Number of steps for smoothing predicted expression. Default is 0.

save_chainbool, optional

Whether to save the MCMC chain. Default is False.

n_jobsint, optional

Number of parallel jobs to run. Default is 1.

adjust_cell_network_by_transcriptome_scalefloat, optional

Scale for adjusting cell network by transcriptome. Default is 0.

See also

predict_gene_expression()

For using these parameters in prediction.

thor.markov_graph_diffusion.markov_graph_diffusion_initialize()

For graph construction details.

thor.markov_graph_diffusion.estimate_expression_markov_graph_diffusion()

For Markov diffusion details.

vae_training(vae_genes_set=None, min_mean_expression=0.1, **kwargs)[source]

Train a VAE model for the spot-level transcriptome data.

Parameters:
  • vae_genes_set (set or None, optional) –

    Set of genes to be used for VAE training.

    If None, all the genes (adata.var.used_for_prediction, which are specified in the prepare_input()) with mean expression > min_mean_expression will be used.

  • min_mean_expression (float, optional) – Minimum mean expression for the genes to be used for VAE training.

  • kwargs (dict) – Keyword arguments for the thor.VAE.train_vae() function.

Return type:

None

See also

thor.VAE.train_vae()

For detailed parameter descriptions and usage.

visualize_cell_network(**kwargs)[source]

Visualize the cell-cell network.

This function internally calls thor.plotting.graph.plot_cell_graph() to create the visualization.

Parameters:

**kwargs (dict) – Additional parameters for the visualization. See thor.plotting.graph.plot_cell_graph() for available options.

Returns:

The network visualization is displayed.

Return type:

None

See also

thor.plotting.graph.plot_cell_graph()

For detailed parameter descriptions and usage.

write_adata(file_name, ad)[source]

Write an AnnData object to disk in the results directory.

Parameters:
  • file_name (str) – Relative path to the file to write. The file will be saved in self.save_dir.

  • ad (anndata.AnnData) – Cell-wise gene expression to save.

write_params(exclude=['conn_csr_matrix'])[source]

Write the current parameters to a JSON file.

Parameters:

exclude (list of str, optional) – List of parameter names to exclude from writing. Default is ["conn_csr_matrix"].

Returns:

The parameters are written to a JSON file in self.save_dir.

Return type:

None