Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: Cortex Hardware API #1165

Closed
4 of 11 tasks
Tracked by #3908
dan-menlo opened this issue Sep 8, 2024 · 9 comments
Closed
4 of 11 tasks
Tracked by #3908

planning: Cortex Hardware API #1165

dan-menlo opened this issue Sep 8, 2024 · 9 comments
Assignees
Labels
category: hardware management Related to hardware & compute needs pm Needs product level decisions

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 8, 2024

Goal

  • We should have a very clear Eng Spec for Cortex Hardware API for Sprint 23

Key Functionality

  • Hardware Detection
    • Cortex can list all available hardware
  • Hardware Activation
    • Cortex has a clear CLI and API to select active hardware
    • Cortex can activate specific hardware (e.g. CPU-only, or specific GPU)
  • Hardware -> Engines
    • Engines initialize using activated Hardware
    • List of active hardware is passed down to the engine (e.g. llama.cpp or TensorRT-LLM)
    • How does this interact with ngl settings?
  • Hardware Usage Detection
    • Cortex can detect free RAM or VRAM
  • Hardware Fallback

Tasklist

  • Design API (e.g. GET /hardware)
  • Design CLI (e.g. cortex hardware list?)

Functionality

  • Hardware Detection
  • Hardware Activation
  • Hardware Usage Detection (e.g. RAM, VRAM)

Cortex & Jan Integration

Previous Issues

Appendix

UX Goal

Cortex.cpp's Hardware API should enable us to do this in Jan
Image

@dan-menlo dan-menlo converted this from a draft issue Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Hardware Selection & Error Handling epic: Cortex Active Hardware Selection & Error Handling Sep 8, 2024
@dan-menlo dan-menlo added the needs pm Needs product level decisions label Sep 8, 2024
@dan-menlo
Copy link
Contributor Author

dan-menlo commented Sep 8, 2024

@louis-jan I'm assigning this to you in Sprint 20, as this has a significant CLI and API design component.

  • Can discuss later in the week if you want/can take on the implementation of it
  • May be a good exercise for you to gain deep understanding of our C++ codebase

EDIT: adding @nguyenhoangthuan99 for implementation

@dan-menlo dan-menlo moved this to Scheduled in Jan & Cortex Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Active Hardware Selection & Error Handling epic: Cortex Hardware Selection API Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Hardware Selection API epic: Cortex Hardware Selection and Model Compatibility API Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Hardware Selection and Model Compatibility API epic: Cortex Hardware Selection API Sep 8, 2024
@freelerobot freelerobot added the category: hardware management Related to hardware & compute label Sep 9, 2024
@dan-menlo
Copy link
Contributor Author

@louis-jan @nguyenhoangthuan99 I am going to move this to Sprint 21, as I think you guys should land the Model Folder and model.yaml first.

@dan-menlo dan-menlo changed the title epic: Cortex Hardware Selection API epic: Cortex Hardware API Sep 17, 2024
@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 17, 2024

The hardware detection serves two main purposes:

  • Installing the correct version of the engine.
  • Running a model that fits the available resources.

To achieve these goals and to make debugging easier, as well as to help users choose the appropriate model, the hardware API/CLI should provide the following information:

  • Operating System (OS)
  • Number of CPU threads
  • Amount of free RAM
  • Presence of AVX instructions (a set of CPU instructions that can accelerate certain computations)
  • GPU information, including:
    • GPU ID
    • GPU name
    • GPU architecture
    • GPU driver version
    • CUDA driver version (for NVIDIA GPUs)
    • Compute capability (for NVIDIA GPUs)
    • Free VRAM (video memory)

example return body:

{
  "os": "windows",
  "arch": "amd64",
  "suitable_avx": "avx2",
  "free_memory": 8192,
  "gpu_info": [
    {
      "id": "0",
      "name": "NVIDIA GeForce RTX 3090",
      "arch": "ampere",
      "driver_version": "552.12",
      "cuda_driver_version": "12.4",
      "compute_cap": "8.6",
      "free_vram": 8192
    }
  ]
}

Note: The commenter mentions that getting the free VRAM information from C++ is challenging and requires further investigation (current approach is parse from output of nvidia-smi command).
This information would allow the system to make informed decisions about which engine version to install and which models can run efficiently on the user's hardware. It also provides valuable data for debugging purposes. cc @louis-jan for recommendation from Jan app for easier integration

@louis-jan
Copy link
Contributor

louis-jan commented Sep 17, 2024

From Jan, we expect to just have some sort of information to select a corresponding engine versions / Settings, such as CPU Instructions / GPUs

But we need to gather comprehensive hardware information for debugging, including CPU, GPU, RAM, OS, and connected monitors (as issues like projector connections have been known to impact performance).

Structure

To make it easier for user support, the hardware information should be grouped for quick lookup (users support), a mix of flattened and grouped structures can be visually overwhelming.

E.g. The supporter have to scroll to the bottom of the file to see os

✅✅
{
  "arch":  "", 
  "free_memory": "", 
  "gpus": [
   {},
   {},
   {}
  ], 
  "os":""
}
{
  "device": {
    "arch":  "", 
    "free_memory": "", 
    "os":""
  }
  "gpus": [
   {},
   {},
   {}
  ], 
}
{
  "cpu": {
    "arch":  "x64", 
    "cores": "4", 
    "model": "Intel Core i9 12900K",
    "instructions": [ "AVX512", "FMA", "SSE" ]
  },
  "os": {
    "version":  "10.2", 
    "name": "Windows 10 Pro"
  },
  "power": {
     "battery_life": 80,
     "charging_status": "charged",
     "is_power_saving": false
  },
  "ram": {
    "total":  "16", 
    "available": "12", 
    "type": "DDR4" // better model name?
  },
  "storage": {
      "total": 512,
      "available": 256,
      "type": "SSD" // better model name?
  },
  "gpus": [
   {},
   {},
   {}
  ], 
  "monitors": [
  ]
  }

Consistent from system to system

Different devices but the same output format, such as GPU Driver. Should not have different response body structure per GPU family.

E.g.

"graphics": [
   {
      "id": "0",
      "name": "NVIDIA GeForce RTX 3090",
      "driver_version": "552.12",
      "cuda_driver_version": "12.4",
      "compute_cap": "8.6",
      "free_vram": 8192
   },
  {
      "id": "1",
      "name": "AMD Radeon RX 6800 XT",
      "driver_version": "5.0.2?",
      "cuda_driver_version": "?",
      "compute_cap": "?",
      "free_vram": 8192
   },
]
"graphics": [
   {
      "id": "0",
      "name": "NVIDIA GeForce RTX 3090",
      "version": "12.4",
      "additional_information": { "driver_version": "552.12", "compute_cap": "8.6" },
      "free_vram": 8192,
      "total_vram": 8192
   },
   {
      "id": "1",
      "name": "AMD Radeon RX 6800 XT",
      "version": "6.1",
      "free_vram": 8192,
      "total_vram": 8192
      "additional_information":  { "rocm_git_revision": "0d0a7a10c1a3" },
   },
]

Try to gather anything could affect the performance?

  • Connected monitors?
  • RAM model / bus?
  • Power Saving mode?

Request

It would be beneficial to have filter query support, allowing clients to only poll for the data they need. E.g. ?filters=gpu,cpu

@nguyenhoangthuan99 @dan-homebrew

@freelerobot freelerobot removed their assignment Oct 14, 2024
@dan-menlo dan-menlo modified the milestones: v1.0.3, v1.0.2 Oct 14, 2024
@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Oct 14, 2024

Cortex Hardware Management Specifications

1. Hardware Detection

Task 1.1: Implement hardware detection function

  • Create a API that returns a JSON object with the following structure:
    {
      "cpu": {
        "arch": "string",
        "cores": "string",
        "model": "string",
        "instructions": ["string"]
      },
      "os": {
        "version": "string",
        "name": "string"
      },
      "ram": {
        "total": "string",
        "available": "string",
        "type": "string"
      },
      "storage": {
        "total": number,
        "available": number,
        "type": "string"
      },
      "gpus": [
        {
          "model": "string",
          "vram": "string",
          "driver_version": "string"
        }
      ],
      "power": {
        "battery_life": number,
        "charging_status": "string",
        "is_power_saving": boolean
      },
      "monitors": [
        {
          "resolution": "string",
          "refresh_rate": number,
          "resolution":"string
        }
      ]
    }
  • Add a new command to the Cortex CLI: cortex hardware list
  • Add a API to list hardware GET /hardware

Task 1.2: Implement platform-specific detection

  • Develop separate modules for Windows, Linux, and macOS to gather hardware information
  • Use appropriate system calls and libraries

2. Hardware Activation

Cortex is stateless, so it's mandatory to save activated hardware on DB to reuse it.

Task 2.1: Design database schema

  • Create a Hardware table in cortex.db with the following schema:
    CREATE TABLE Hardware (
      id TEXT PRIMARY KEY,
      type TEXT NOT NULL,
      name TEXT NOT NULL,
      is_active BOOLEAN DEFAULT 0,
      properties TEXT, # properties of hardware as json dumped string
    );

We can decide what hardwares should be place in database.

Task 2.2: Implement CLI for hardware activation

  • Add a new command to the Cortex CLI: cortex hardware activate <type> <name>
  • Implement logic to update the is_active field in the database

Task 2.3: Create API for hardware activation

  • Implement a RESTful API endpoint: POST /api/v1/hardware/activate
  • Accept JSON payload with type and name fields
  • Update database and return activation status

3. Hardware -> Engines Integration

Task 3.1: Modify engine initialization

  • Update engine initialization code to accept a list of active hardware
  • Implement a function to query active hardware from the database
  • If no activated hardwares use the default setting like current logic.

Task 3.2: Implement hardware passing to engines

  • query from database for engines using hardware -> can reuse cortex ps for running model
    Image

Task 3.3: Handle ngl settings

  • Implement logic to automatically set ngl to 0 when running on CPU-only
  • Create a model compatibility API to determine appropriate ngl settings for different hardware configurations

4. Hardware Usage Detection

Task 4.1: Implement RAM and VRAM detection

  • Create functions to detect free RAM and VRAM
  • Use platform-specific methods (e.g., Windows API, /proc/meminfo on Linux, sysctl on macOS)

Task 4.2: Implement GPU-specific detection

  • Create a module for NVIDIA GPU detection using nvidia-smi
  • Implement a placeholder module for AMD GPU detection using rocm-smi
  • Research and implement Vulkan-based VRAM detection as a fallback method

5. Hardware Fallback

Task 5.1: Implement fallback logic

  • Create a HardwareFallbackManager class to handle fallback scenarios
  • Implement logic to fall back to CPU if GPU inference fails
  • Implement detailed error logging for hardware-related failures
  • Create clear, user-friendly error messages for terminal output

Task 5.2: Implement RAM check for CPU fallback

  • Add a check for available RAM before falling back to CPU
  • Implement appropriate error handling and user notification if insufficient RAM is available

Additional Tasks

Task 6.1: Documentation

  • Create comprehensive documentation for the new hardware management features
  • Include examples and use cases in the documentation

Task 6.2: Testing

  • Develop unit tests for each new function and class
  • Create integration tests to ensure proper interaction between hardware detection, activation, and engine initialization

@dan-menlo
Copy link
Contributor Author

@nguyenhoangthuan99 This is well drafted. I have a few clarifications

1. Hardware Detection

  • I agree with this direction
  • Out of curiosity, why are we detecting monitors? (how is this relevant?)

2. Hardware Activation

We should specify that Cortex by default activates all hardware compute units, as part of setup/installation.

Will we need to define Hardware as a individual compute unit (e.g. compute + RAM?)

  • CPUs (+ RAM)
  • GPUs (+ VRAM)

There are some open questions on my side:

  • Can the CPU + RAM ever be deactivated?

3. Hardware -> Engines Integration

We may need to specify the logic if there are multiple active GPUs

  • e.g. which GPU has priority (e.g. Nvidia preferred, biggest capacity)
  • We should pick a smart default, as it will greatly improve user experience

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Oct 15, 2024

@nguyenhoangthuan99 There are a couple of out-of-scope but related issues that I'd like to also put down ideas for:

cortex run and cortex model start can take in hardware param?

  • We should enable users to specify which GPU to use when running a model
  • We likely need to use the index from cortex hardware list
  • We need to think through this more clearly - I think my idea is incorrect
cortex run gorilla --hardware 1,2,3

Model Compatibility

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Oct 15, 2024

1. Hardware Detection

We need to gather comprehensive hardware information for debugging, including CPU, GPU, RAM, OS, and connected monitors (as issues like projector connections have been known to impact performance).

As Louis mentioned earlier, when using GPUs to render high-resolution content, high-frequency monitors can also affect GPU performance. Monitors are an additional hardware component that could impact performance, so we should notify users about this.

2. Hardware Activation

In my opinion, we should only allow activation of GPUs (or NPUs in the future when Cortex supports them). CPUs and RAM are always used to save and process buffers, even if we offload everything to the GPU.

3. Hardware -> Engines Integration

This part should fall back to the Model Compatibility API. Choosing which hardware to run depends mostly on the model, not just the engines. The best scenario, I think, would be to have a threshold for available VRAM (e.g., 3GB, 4GB, etc., depending on which model to run). Among activated hardware, we should choose the ones that have enough available VRAM. If more than one GPU is suitable, prefer NVIDIA GPUs.

Can cortex run and cortex model start take in hardware params?

I believe we have to implement this and provide an API for it. This is an important feature that makes Cortex more dynamic, eliminating the need to manually set environment variables. This API should, like hardware activation, only allow running with GPU (and NPU). However, it will require users to set hardware of the same type in a list (e.g., only NVIDIA GPUs).

Model Compatibility

Model compatibility will depend on activated hardware. Here are some ideas:

If users don't specify which hardware to use, we can choose hardware by:

  1. Ensuring the model can offload at least 30% of NGLs (Num GPU Layers)
  2. Running the model on the most powerful hardware: NVIDIA > AMD > (NPU) > CPU. If criterion 1 cannot be satisfied, fall back to the next available option. CPU will be the last option.

If users specify hardware:
Calculate NGLs and context length to fit that hardware and return recommendations.

@dan-menlo
Copy link
Contributor Author

Marking as complete. Implementation epic: #1568

@github-project-automation github-project-automation bot moved this from In Progress to Review + QA in Jan & Cortex Oct 31, 2024
@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Jan & Cortex Nov 1, 2024
@gabrielle-ong gabrielle-ong removed this from the v1.0.2 milestone Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: hardware management Related to hardware & compute needs pm Needs product level decisions
Projects
Archived in project
Development

No branches or pull requests

6 participants