In this module, I present the pipeline for extracting features from the NF1 data.
To extract features from the NF1 data, I use DeepProfiler, commit 8752f69
.
Based off of previous projects in the lab, I decided to use a pretrained model from the LUAD Cell Painting repository with DeepProfiler.
DeepProfiler has the function to be able to train and create your own model (also named checkpoint
) which I would like to test in the future.
The config files were made based off the one used in the same repository from above. The following changes were made to each config file, which are NF1_nuc_config.json and NF1_cyto_config.json.
Both:
"Allele" -> "Genotype
In the LUAD study, alleles were compared across cell painting images. For the NF1 data, the genotypes of the NF1 gene are compared.dataset: images: {file format: tif, bits: 16, width: 1080, height: 1080} -> dataset: images: {file format: tiff, bits: 8, width: 1224, height: 904}
: The image details are changed to reflect the NF1 data.prepare: implement: true -> prepare: implement: false
We do not prepare the NF1 data with illumination correction (already done) or compression with Deep Profiler.dataset: images: channels: [DNA, ER, RNA, AGP, Mito] -> dataset: images: channels: [DNA, ER, RNA]
While the Cell Painting dataset has multiple channels for cell images, the NF1 data only has the first three channels to examine.
NF1_nuc_config.json
:
dataset: locations: box_size: 96 -> dataset: locations: box_size: 128
This change expands the size of the box put around each nuclei that DeepProfiler interprets. This expansion was recommended by Juan Caicedo to improve performance.
NF1_cyto_config.json
:
dataset: locations: box_size: 96 -> dataset: locations: box_size: 256
This change expands the size of the box around each cell that DeepProfiler interprets. This expansion attempts to capture as much of the cytoplasm as possible (this will be benchmarked in the future to assess the best box size).
# Run this command in terminal to create the conda environment for feature extraction
conda env create -f 3.NF1_feature_extraction_env.yml
# Run this command in terminal to activate the conda environment for Deep Profiler feature extraction
conda activate 3.feature-extraction-NF1
Clone the DeepProfiler repository into 3_extracting_features/ with
# Make sure you are located in 3_extracting_features/
cd 3_extracting_features/
git clone https://github.com/cytomining/DeepProfiler.git
Install the DeepProfiler repository with
# Make sure you are located in DeepProfiler/ to install
cd DeepProfiler/
pip install -e .
Based on previous projects within the lab, we found using Tensorflow GPU when using DeepProfiler improves performance. To setup, follow these instructions. I use Tensorflow GPU while processing NF1 data.
Inside the notebook compile_DP_projects.ipynb, the variables nuc_project_path
and cyto_project_path
need to be changed to reflect the desired object DeepProfiler project locations.
In order to profile features with DeepProfiler, a project needs to be set up with a certain file structure and files.
In compile_DP_projects.ipynb, the necessary project structure is created using the functions from DPutils.py.
The config files (NF1_nuc_config.json/NF1_cyto_config.json) are copied to their corresponding projects and the pretrained model (efficientnet-b0_weights_tf_dim_ordering_tf_kernels_autoaugment.h5) to both projects. Both of these files are located within the DP_files folder for reference.
We need to compile an index.csv
file as it necessary for DeepProfiler to load each image.
We create this using the the annotations file.
Using the index.csv, we compile the locations (in project/input/locations), which are necessary csv files for DeepProfiler to find the single cells in each image.
For more information on DeepProfiler, please reference the DeepProfiler wiki.
# Run this script in terminal to compile the DeepProfiler projects
bash 3.compile-DP-projects.sh
Change path/to/DP_nuc_project
and path/to/DP_cyto_project
below to the nuc_project_path
and cyto_project_path
set in step 3.
Note: Only include what is in the pathlib.Path(), not the full path for each variable (e.g pathlib.Path('NF1_nuc_project') -> use NF1_nuc_project)
# Run this script in terminal to extract features with DeepProfiler
python3 -m deepprofiler --gpu 0 --exp efn_pretrained --root `path/to/DP_nuc_project` --config NF1_nuc_config.json profile
python3 -m deepprofiler --gpu 0 --exp efn_pretrained --root `path/to/DP_cyto_project` --config NF1_cyto_config.json profile
Inside the notebook rename_cyto_locations.ipynb, the variable cyto_locations_path
needs to be changed to reflect the plate in the /locations
directory from the Cytoplasm project that contains the files to be renamed.
Note: Currently, there is only one plate from this pilot data so that is why the path goes directory to one plate and not to the whole /locations
directory.
Due to the format of the checkpoint being used, the location files within the Cytoplasm project must end with Nuclei.csv
.
This means that the files in the /inputs/locations
directory for both the Nuclei and Cytoplasm projects are named the exact same (e.g {well}-{site}-Nuclei.csv).
With the rename_cyto_locations.ipynb notebook, the suffix of the files in Cytoplasm projects are renamed to Cytoplasm.csv
to avoid confusion during downstream analysis.
# Run this script in terminal to rename all files in the /inputs/locations/ directory for the Cytoplasm Project
bash 3.rename_cyto_locations.sh