Running traced machine learning model fails

alexaatm · October 13, 2022, 2:49pm

Hello!

I have a traced model (followed the instructions from Imfusion-Inference to trace a pretrained torch model and to write a yaml file). Then I used the Machine Learning → Run ML model where I specified the ocnfiguration file.

As a result, I get the following error:

[ML.MachineLearningModel] Prediction failed: isTuple() INTERNAL ASSERT FAILED at "C:\\Users\\MARCOE~1\\Projects\\imfusion\\EXTERN~2\\pytorch\\_BUILD~1\\pytorch\\aten\\src\\ATen/core/ivalue_inl.h":1101, please report a bug to PyTorch. Expected Tuple but got String
Exception raised from toTuple at C:\Users\MARCOE~1\Projects\imfusion\EXTERN~2\pytorch\_BUILD~1\pytorch\aten\src\ATen/core/ivalue_inl.h:1101 (most recent call first):
0,000,7FF,DB5,214,A6A00007FFDB5213AF0 c10d.dll!c10::detail::LogAPIUsageFakeReturn [<unknown file> @ <unknown line number>]

//some skipped lines

[ML.MachineLearningModel] No prediction was returned.
[ML.Conversion.dataItemToDataList] Ordering should match number of elements in data item.
Algorithm computed in 3.055 s

Have you encountered such an error before?
Would be grateful for any hints.

Thank you,
Alex

mlupetti · October 17, 2022, 11:04am

Hi Alex,

thank you for your feedback. I’d need some more information about the model you’re tracing, like what is the output type (is it a tuple, named tuple or a single ouptup?) and what is the type of the Tensors you are returning (is it a string, or float, or?). Can you also share your inference configuration file (.yaml)?

This would help me to figure out the issue you’re seeing.

Best,
Mattia

alexaatm · October 21, 2022, 1:26pm

Hi Mattia,

Thank you for getting back to me on this!

The output type of the predict function is Tuple[np.ndarray, np.ndarray] which correspond to predicted_segmentation and class_probabilities.
Here are the types:
type of predicted_segmentation: int64
type of class _probabilities: float16
Here is the yaml configuration file:

Version: 4.0
Type: NeuralNetwork
Name: KidneySegmentationModel
Description: Segmentation of kidneys
Engine: torch # Could be onnx
ModelFile: Kidney_traced_model_8.pt # Path to the actual model file (could be a onnx file)
ForceCPU: false # Set it to true if you want to perform the inference on the CPU instead of the GPU
Verbose: false # Print many info messages
MaxBatchSize: 1 # Maximum number of images to run through the network simulatenously
LabelNames: [FirstObject, SecondObject] # Names of the different labels encoded as channels of the output tensor
PostProcessing:
  - ResampleToInput: {} # Resample the prediction image back to the original image
  - ArgMax: {}  # Convert the multi-channel probability map to a label map
PredictionOutput: Image

P.S. The network used is nnunet, Nnunet trainer.
I used nnunet custom function to restore the model and get the network. The rest of the code is the same as in Imfusion tutorial for tracing. Tracing script runs without any issues.

import torch
from nnunet.training import model_restore

if __name__ == "__main__":
    #dummy input
    dummy_input = torch.zeros(1, 1, 64, 64, 64) # batch x channels x slices x height x width
    dummy_input = dummy_input.cuda() 

    #get the model
    pkl="nnunet-database/nnUNet_trained_models/nnUNet/3d_fullres/Task048_KiTS_clean/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0/model_final_checkpoint.model.pkl"
    model=model_restore.restore_model(pkl, checkpoint=pkl[:-4], train=False)
    unet=model.network
    print("is instance of nn.Module? ", isinstance(unet, torch.nn.Module))
    unet=unet.eval()

    # trace the model
    traced_script_module = torch.jit.trace(unet, dummy_input)
    traced_script_module.save("Kidney_traced_model_8.pt")

If you need any additional information, I’ll try to provide it.
Thank you!

Kind regards,
Alex

alexaatm · October 27, 2022, 2:14pm

Hey there!

Was wondering if there is any update on the topic?

Meanwhile, I thought that errors might be coming from the wrongly configured file, so have been playing around with the configuration file for the ML model, but still run into the same error.

The example configuration file says "all available operations are available in the Python documentation of the SDK)". Where could I find this list of operations? I only found what appears to be the c++ [Imfusion sdk documentation](file:///C:/Program%20Files/ImFusion/ImFusion2-41-4/ImFusion%20Suite/doc/SDK/namespace_im_fusion_1_1_m_l.html)? But how to know which operations are allowed in configiration files?

I would like to have the following preprocessing steps:

cropping till non 0 pixels
resampling to a certain spacing or shape (not a fixed rsolution like in the configuration example)
normalization (of CT) to [0.5, 99.5] → in the example there is only NormalisePercentile.
z score normalization
dividing into patches/sub-images with overlap of patch_size/2
Mirroring along all valid axes

Is it possible to achieve that with specifying operations in the configuration file? Or is beforehand manual processing needed before calling the Run ML model function in ImFusion?

Any help or hint would be greatly appreciated!

Thanks,
Oleksandra

mlupetti · October 27, 2022, 6:59pm

Hi Alex,

thank you for getting back with more details.

I tried your approach and I can reproduce your issue. The trouble comes from the fact that you probably traced the model with a torch version >= 1.10, which is in the requirements of the nnUnet package.
The ImFusionSuite however links against torch 1.8, so the actual problem is that the older version of torch which we use cannot correctly run a model traced with a newer one (quite understandably).

The good news is that downgrading to torch 1.8 doesn’t break nnUnet, or at least your export script, and with this version I was still able to trace the model in your script. You can downgrade to torch 1.8 with the command
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

This solves your issue, and I’m able to run the model in the Suite.

For that however I needed to adapt the configuration file in a number of ways that I’m gonna list here below:

The Kits19 dataset your model is trained on has some non trivial patient orientation matrix you need to apply, otherwise the resulting torch tensor would look rotated w.r.t. its world coordinate system. You can do that by adding BakeTransformation to the list of preprocessing operation
Secondly the network expects the data to be on an equally spaced grid. Thus, I resampled the input to the isotropic resolution 1mm, while the original data resolution is anisotropic along z (0.9 X, 0.9 Y, 0.5 Z).
Even after resampling the data is quite large and the network was not fitting on my 4Gb VRAM of my laptop. You can subdivide your input image into patches of a given size by adding a Sampling section to your configuration, see the example.

The trickiest part in your case was that the network is actually outputting a tuple of 5 tensors as predictions (you can check that by opening the traced model with netron for instance). Our default configuration expects a single-input single-output model, in your case however you have a single-input multiple-output model. We support that too, but the documentation for this is not there yet, so please have a look at the config file I put together and the comments therein.

Version: 4.0
Type: NeuralNetwork
Name: KidneySegmentationModel
Description: Segmentation of kidneys
Engine: torch # Could be also onnx
ModelFile: Kidney_traced_model_8.pt # Path to the actual model file (could be a onnx file)
ForceCPU: false # Set it to true if you want to perform the inference on the CPU instead of the GPU
Verbose: false # Print many info messages
MaxBatchSize: 1 # Maximum number of images to run through the network simulatenously
# The following section specifies the list of operations to apply to the input *before* it is feeded to the model
PreProcessing: 
- BakeTransformation # applies the affine transformation of the input such that the results transformation is the identity
- Resample: # Resamples the input to the desired target resolution, if a scalar is given isotropic resolution is assumed
    resolution: 1.0
# In the multiple-output case, each output head type must be specified, this is mostly used for the UI
PredictionOutput: [Image, Image, Image, Image, Image]
#  In the multiple-output case, one must also specify the names of the output heads 
# as specified by the model. When using a torch model returning a tuple of (unnamed) tensors, 
# this has the unfortunate consequence that our software automatically "names" 
# each entry in the output tuple Prediction0, Prediction1, ... . This is an implementation detail that unfortunately the user must know to setup the config (but he shouldn't be supposed to know). We are working on a solution for this issue.
EngineOutputFields: [Prediction0, Prediction1, Prediction2, Prediction3, Prediction4]
# In case of multiple outputs, the user must add the output field names above as intermediate layer for the naming. I did it only for the output we are interested in.
LabelNames: 
  Prediction0: [Kidney, Tumor] # Names of the different labels encoded as channels of the output tensor, will be used by the UI
PostProcessing:
# Operations support an `apply_to` field which can be used to manually select on which elements each operation will be applied to. In the `Remove` op below, all the elements are removed apart from `Prediction0`
- Remove: # Remove the deep supervision heads which are not super interesting at inference
    apply_to: [Prediction1, Prediction2, Prediction3, Prediction4]
- Rename: 
    source: [Prediction0]
    target: [Prediction]
- ArgMax: {}
- KeepLargestComponent # Keep the largest component of each individual label
# This section specifies the sampling strategy at runtime
Sampling:
- DimensionDivisor: 64 # Pads the image to the next multiple of this number, it is used to make sure that UNet downsampling and upsampling paths produce images of the same size
- MaxSizeSubdivision: 128 # Split the images in the smallest number of patches of size `MaxSizeSubdivision`

This is a screenshot of the results I’m retrieving on case 0000 of the Kips19 dataset:

Please note the label names (Kidney, Tumor) in the Display Options of the 3d View.

You can download the traced model, the configuration yaml and the input image at this link.

Let me know if that works for you.

Best,
Mattia

mlupetti · October 27, 2022, 8:25pm

Hi Oleksandra,

we have a python SDK documentation where the available operations are listed in a reader friendlier way. The python doc should be in the python module itself under “/Suite/imfusion/”.

Please note though, that they are the very same operations that you found, as our operations are written in C++ and bound to Python. Also, they are factory registered, such that you can use it from a yaml configuration file, both from Python and from C++.

This means that yes, the list of operations you see can be used in the yaml configuration. That was actually the whole point in implementing them in C++. By using them for training machine learning models in Python, we are sure that the very same pre- and post-processing used there is also used at inference time in the Suite.

Below I listed the operation we support that match your requests.

Cropping till non 0 pixels: unfortunately we don’t have a dedicated operation for that, but in the Suite you can right click on the data > Edit Mask > Intensity Mask > Add and set the desidered threshold. Then right click again > Apply Mask, make sure the checkbox “Auto crop” is checked > Compute.

resampling to a certain spacing or shape:

- ResampleDims:
      target_dims: [128, 256, 512] # for instance

normalization (of CT) to [0.5, 99.5]: this looks correct to me, NormalizePercentile expects a float input, thus you can add a MakeFloat just before it if your image is not float already. Please also note that the percentiles are expressed in the [0, 1] interval, not in the [0, 100]% interval. You could use something like this:

    - NormalizePercentile:
          min_percentile: 0.005
          max_percentile: 0.995
          clamp_values: true
          ignore_zeros: true

z score normalization:
```
    - NormalizeNormal: {}
```
dividing into patches/sub-images with overlap of patch_size/2: we don’t have an operation for “tiling” an image with controllable overlap among the tiles. But something similar you could get by using SplitROISampler
```
    - SplitROISampler: 
          roi_size: [a, b, c]
```
This will evenly divide the image in sub-tiles with overlap computed such to cover the original image with the smallest number of tiles of the given size in each dimension.
Mirroring along all valid axes:
```
    - AxisFlip:
        axes: ["x", "y", "z"]
```
there is also the random version RandomAxisFlip.

alexaatm · November 3, 2022, 1:52pm

Hi Mattia,

Thank you so much for putting in all the effort for recreating and solving the issue!!! Many insights and details I have overlooked, especially regarding the multiple-output case, and the torch version.

I followed your approach and made all the modifications. However, run into another issue: RuntimeError: cuDNN error: CUDNN_STATUS_VERSION_MISMATCH

For downgrading the torch version I used the command you provided. I checked that the previous torch installation was 1.12.0+cu116, and after downgrading 1.8.1+cu111 .

The version of ImFusion I have installed is ImFusion Suite for Academia Version 2.41.4.

I saw in the ImFusion Help->About the list of Acknowledgments where it says:

	Third-party software packages used in TorchPlugin:
	cuda
	Copyright 1993-2018 NVIDIA Corporation. All rights reserved.
	Version: 11.3
	License: EULA
	https://developer.nvidia.com/cuda-zone
	
	pytorch
	Copyright (c) 2016- Facebook, Inc (Adam Paszke)
	Copyright (c) 2014- Facebook, Inc (Soumith Chintala)
	Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
	Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
	Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
	Copyright (c) 2011-2013 NYU (Clement Farabet)
	Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
	Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
	Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
	Version: 1.8.1
	License: BSD-style
https://pytorch.org/

It seems like the 2.41.4 Version of ImFusion uses Cuda 11.3, while with downgraded torch works with cuda 11.1.

However, I did not find a way (so far) to install cuda 11.3 while keeping torch 1.8.1. But I am still looking into that.

Do you think this is the issue?

Thanks,
Oleksandra

alexaatm · November 3, 2022, 2:21pm

A follow up.

There was another issue when running the traced model (before any of the described errors were coming)… ImFusion would ask for the cudnn_train dlls and would crash.

I found a workaround it (before posting the original forum message here), which was to manually find the corresponidng train dlls, download them and put them in the ImFusion SuiteDev (and Suite) folders. After that the software was not crashing. I downloaded this cuda lib: cuDNN Archive | NVIDIA Developer. Exactly cuDNN v8.2.1 (June 7th, 2021), for CUDA 11.x. Which was version 11030, corresponding to infer dlls in SuiteDev files in Imfusion.

I thought I could get rid of this workaround now, as the configuration file and torch versions are correct, however, removing these files led to the same error again. It seems like these dlls are not packed in the version of ImFusion I have: ImFusion Suite for Academia Version 2.41.4.?

I know this doesn’t relate 100% to the problem of running the traced model, but it was the very first issue I faced when using the Run ML model feature. And appeared after the Torch model has been created.

prevost · November 3, 2022, 3:16pm

Cudnn has split some of its their functionalities in two different dlls: cudnn_ops_train (for the training obviously) and cudnn_ops_infer (for operations only used at inference). In order to save some space, we only ship the infer ones in our installers.
My assumption is that the model you are trying to run was traced when it was still in training mode.
Can you try adding a line model.eval() before tracing it?

prevost · November 3, 2022, 3:22pm

Hmm, this error is a bit surprising. I would not expect the traced model to contain information about the cudnn version used on the Python side (at tracing time).
Can you try to move the model to the CPU before tracing it?
model.cpu()

alexaatm · November 3, 2022, 3:23pm

Hi Raphael!

Thanks for your quick reply. I do call eval() function before tracing the model. Above I shared the code I used for tracing, which also worked for your colleage Mattia. Here is it:

import torch
from nnunet.training import model_restore


if __name__ == "__main__":
    #dummy input
    dummy_input = torch.zeros(1, 1, 64, 64, 64) # batch x channels x slices x height x width
    dummy_input = dummy_input.cuda() #very important!

    #get the model
    pkl= "nnunet-database\\nnUNet_trained_models\\nnUNet\\3d_lowres\\Task135_KiTS2021\\nnUNetTrainerV2__nnUNetPlansv2.1\\fold_0\\model_final_checkpoint.model.pkl"
    model=model_restore.restore_model(pkl, checkpoint=pkl[:-4], train=False)
    print("model=", model)

    unet=model.network
    print("is instance of nn.Module? ", isinstance(unet, torch.nn.Module))

    unet=unet.eval()

    traced_script_module = torch.jit.trace(unet, dummy_input)
    traced_script_module.save("C:\\Pretrained_traced_models\\ct-kidney-correct\\kidney-model.pt")

    print("traced code=",traced_script_module.code)

So i am not sure what could be the reason…

alexaatm · November 3, 2022, 3:46pm

Hey Raphael,

I just added moving the model and the inout data to cpu in my tracing script.

import torch
from nnunet.training import model_restore


if __name__ == "__main__":
    #dummy input
    dummy_input = torch.zeros(1, 1, 64, 64, 64) # batch x channels x slices x height x width
    dummy_input = dummy_input.cuda() #very important!

    #get the model
    pkl= "nnunet-database\\nnUNet_trained_models\\nnUNet\\3d_lowres\\Task135_KiTS2021\\nnUNetTrainerV2__nnUNetPlansv2.1\\fold_0\\model_final_checkpoint.model.pkl"
    model=model_restore.restore_model(pkl, checkpoint=pkl[:-4], train=False)
    print("model=", model)

    unet=model.network
    print("is instance of nn.Module? ", isinstance(unet, torch.nn.Module))

    unet=unet.eval()

    unet=unet.cpu()
    dummy_input = dummy_input.cpu()

    traced_script_module = torch.jit.trace(unet, dummy_input)
    traced_script_module.save("Pretrained_traced_models\\ct-kidney-correct\\kidney-model3.pt")

    print("traced code=",traced_script_module.code)

I ran it with and without adding the train dlls. Without them, it crashed, the same way as before. WIth them, it ran, but the same issue about mismatch remained…

prevost · November 3, 2022, 4:29pm

Can you try adding those DLLs to the folder C:\Program Files\ImFusion\ImFusion Suite\Suite?

These are the _train DLLs that correspond to the exact same version of cudnn.

alexaatm · November 3, 2022, 5:07pm

IT WORKS NOW!

Thank you so much! INteresting, because the version of the files you added seems to be the same as the files I downloaded (see the version of my files below in the image):

However, their sizes are different, so it must be something else.

Thank you so much for your help and all the valuable input!!

prevost · November 3, 2022, 5:27pm

Glad to hear that!

I will create an internal ticket to ship those DLLs with our installers. They are not that big (well, comparatively to the other cuda/cudnn DLLs) and this is not the first time a user has this problem.