Local.ai on Unraid Not Using Nvidia GPUs

Tuesday, September 17th 2024

I am on an adventure to try and get Local.ai to work on my server.  I've been using Ollama and Open WebUI, but I wanted to try something different, and something that has a little more to offer.  I initially tried using Biniou, and while it has a great collection of models and features, I feel it's still a bit premature and needs a lot of work before it becomes a more stable and better usable AI tool.

I decided that I would go with Local.ai, but I immediately ran into some issues with how it was working.  I built to container using the standard Unraid docker container, but after doing some research, I found that the container wasn't using my Nvidia cards.

The container I was using was the localai/localai:latest-gpu-nvidia-cuda-12.  Additionally, I made sure that I had the --gpus=all parameter set when pulling the container.

I checked the status of my queries with the nvidia-smi command (both, in the container, and on the server console).   After reviewing the logs, I found that I was receiving errors with llama.cpp not being able to load the models.

12:20PM INF Trying to load the model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with the backend '[llama-cpp llama-ggml llama-cpp-fallback rwkv stablediffusion piper whisper huggingface bert-embeddings /build/backend/python/autogptq/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/coqui/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/mamba/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/transformers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/vllm/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/bark/run.sh /build/backend/python/exllama/run.sh]'
12:20PM INF [llama-cpp] Attempting to load
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend llama-cpp
12:20PM INF [llama-cpp] attempting to load with AVX variant
12:20PM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:20PM INF [llama-cpp] Autodetection failed, trying the fallback
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend llama-cpp-avx
12:20PM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:20PM INF [llama-ggml] Attempting to load
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend llama-ggml
12:20PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:20PM INF [llama-cpp-fallback] Attempting to load
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend llama-cpp-fallback
12:20PM INF [llama-cpp-fallback] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:20PM INF [rwkv] Attempting to load
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend rwkv
12:20PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:20PM INF [stablediffusion] Attempting to load
12:20PM INF Loading model 'Bunny-Llama-3-8B-Q4_K_M.gguf' with backend stablediffusion
12:20PM INF [stablediffusion] Loads OK

I initially thought that it could be from cuda support, but after checking with Nvidia, both RTX A2000 and RTX A4000 are both supported by cuda 12.  

After some research, I thought that the issue could be AVX2 support not being available in my processor.  So, I tried building a custom compose file and disabling the AVX2 flag during the build process.  

services:
  localai:
    image: localai/localai:latest-gpu-nvidia-cuda-12
    build:
      args:
        - AVX2=false
    container_name: LocalAI
    restart: unless-stopped
    network_mode: bridge
    environment:
      - TZ=America/Chicago
      - HOST_OS=Unraid
      - HOST_HOSTNAME=Zealot
      - HOST_CONTAINERNAME=LocalAI
      - DEBUG=false
    labels:
      - "com.unraid.docker.managed=dockerman"
      - "com.unraid.docker.webui=http://[IP]:[PORT:8080]/"
      - "com.unraid.docker.icon=https://github.com/go-skynet/LocalAI/assets/2420543/0966aa2a-166e-4f99-a3e5-6c915fc997dd?raw=1"
    ports:
      - "8080:8080/tcp"
    volumes:
      - /mnt/user/ai/models:/build/models:rw
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Unfortunately, this didn't solve the issue either as AVX support (without AVX2) is still sufficient to run models.  Plus, I had been running these same models on Ollama.

After trying to figure this one out, I decided to go with the latest cpu container and try manually add the --gpus=all parameter in hopes that the container would build with cpu support, but find the GPU cards and use them.  Maybe something was wrong with the cuda/gpu containers and this was just a small loophole.  (This was also fruitless).

After reading the documentation on Local.ai's GPU Acceleration page, to use cuda support, they recommend using the cublas containers instead.  The default containers linked in Unraid aren't the same containers that are recommended from Local.ai.  So the easy solution was to change the repository reference and build the image with the new link.  So I tried using localai/localai:master-cublas-cuda12 instead.  Unfortunately, this didn't resolve the issues either.

So, I am back to thinking that this problem has to deal with AVX support.  I am going to research more, and try rebuilding the containers from scratch.  Hopefully this will render some better results.