Code Assistant Setup mit GTX 4060 vLLM + RAG + Tabby + Forgejo Integration

Find a file

Ingo Rohlf a0a52a34d3 nächste iteration mit sonet 4.5...		2026-02-16 22:31:20 +01:00
.env	nächste iteration mit sonet 4.5...	2026-02-16 22:31:20 +01:00
code-assistant-npm-complete.tar.gz	nächste iteration mit sonet 4.5...	2026-02-16 22:31:20 +01:00
docker-compose-gpu.yml	nächste iteration mit sonet 4.5...	2026-02-16 22:31:20 +01:00
docker-compose-services.yml	nächste iteration mit sonet 4.5...	2026-02-16 22:31:20 +01:00
FILE_STRUCTURE.md	first commit - Erster Wurf von Claude.ai Sonet 4.5	2026-02-16 20:59:41 +01:00
guide.md	first commit - Erster Wurf von Claude.ai Sonet 4.5	2026-02-16 20:59:41 +01:00
guide.pdf	first commit - Erster Wurf von Claude.ai Sonet 4.5	2026-02-16 20:59:41 +01:00
Makefile	first commit - Erster Wurf von Claude.ai Sonet 4.5	2026-02-16 20:59:41 +01:00
QUICKSTART.md	nächste iteration mit sonet 4.5...	2026-02-16 22:31:20 +01:00
README.md	first commit - Erster Wurf von Claude.ai Sonet 4.5	2026-02-16 20:59:41 +01:00

README.md

Code Assistant Setup mit GTX 4060

vLLM + RAG + Tabby + Forgejo Integration

Dieses Setup ermöglicht es dir, verschiedene Code-Assistenz-Ansätze zu testen:

vLLM: Schnelle LLM-Inferenz auf der GPU
RAG: Code-Suche in euren Repositories
Tabby: GitHub Copilot Alternative
Forgejo: Self-hosted Git Server

Architektur

┌─────────────────────────────────────────────────────────────┐
│                    GPU-Maschine (GTX 4060)                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ vLLM Server  │  │    Ollama    │  │  Open WebUI  │      │
│  │   :8000      │  │   :11434     │  │    :3000     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
                            │ HTTP
                            │
┌─────────────────────────────────────────────────────────────┐
│                    Service-Server (ohne GPU)                 │
│  ┌────────────┐ ┌──────────────┐ ┌────────────┐            │
│  │  Forgejo   │ │    Tabby     │ │  Continue  │            │
│  │   :3001    │ │    :8080     │ │   :8082    │            │
│  └────────────┘ └──────────────┘ └────────────┘            │
│                                                              │
│  ┌────────────┐ ┌──────────────┐ ┌────────────┐            │
│  │  Qdrant    │ │ Embeddings   │ │  Indexer   │            │
│  │   :6333    │ │    :8081     │ │            │            │
│  └────────────┘ └──────────────┘ └────────────┘            │
│                                                              │
│  ┌───────────────────────────────────────────┐              │
│  │            Nginx Reverse Proxy             │              │
│  │                  :80                       │              │
│  └───────────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                            │
                            │
                      ┌─────────────┐
                      │  VS Code +  │
                      │  Continue   │
                      └─────────────┘

Voraussetzungen

GPU-Maschine

NVIDIA GTX 4060 (16GB VRAM)
Docker + Docker Compose
NVIDIA Container Toolkit
Ubuntu/Debian empfohlen

Service-Server

Docker + Docker Compose
Netzwerkverbindung zur GPU-Maschine
Mind. 8GB RAM, 50GB Disk

Installation

1. GPU-Server Setup

# NVIDIA Container Toolkit installieren
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Repository klonen
git clone <this-repo>
cd code-assistant

# .env Datei erstellen
cp .env.example .env
# Optional: HF_TOKEN eintragen für gated models

# Services starten
chmod +x start-gpu-server.sh
./start-gpu-server.sh

Modell-Optionen für 16GB VRAM:

deepseek-ai/deepseek-coder-6.7b-instruct (Standard, sehr gut)
codellama/CodeLlama-7b-Instruct-hf (Meta)
Qwen/Qwen2.5-Coder-7B-Instruct (Alibaba)
bigcode/starcoder2-7b (BigCode)

Zum Wechseln des Modells in docker-compose-gpu.yml die --model Zeile ändern.

2. Service-Server Setup

# Repository klonen
git clone <this-repo>
cd code-assistant

# .env konfigurieren
cp .env.example .env
nano .env
# WICHTIG: GPU_SERVER_IP anpassen!

# Services starten
chmod +x start-services.sh
./start-services.sh

3. Forgejo einrichten

# Öffne http://localhost:3001
# Oder http://localhost/git/

# Installation durchführen:
# 1. Admin-Account erstellen
# 2. Einstellungen > Anwendungen > Access Token erstellen
# 3. Token kopieren

# Token in .env eintragen
echo "FORGEJO_TOKEN=dein_token_hier" >> .env

# Services neu starten
docker-compose -f docker-compose-services.yml restart code-indexer

4. Repositories hinzufügen

Erstelle oder importiere Repositories in Forgejo. Der Code-Indexer wird automatisch:

Alle Repositories klonen
Code in Chunks aufteilen
Embeddings generieren
In Qdrant indizieren

Initiale Indexierung startet nach 10 Sekunden, danach alle 60 Minuten.

Nutzung

Option 1: vLLM Direkt

API-Test:

curl http://GPU_SERVER_IP:8000/v1/models

curl http://GPU_SERVER_IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-coder",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}],
    "max_tokens": 500
  }'

In Code:

from openai import OpenAI

client = OpenAI(
    base_url="http://GPU_SERVER_IP:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="deepseek-coder",
    messages=[
        {"role": "user", "content": "Explain this code: def fib(n): ..."}
    ]
)
print(response.choices[0].message.content)

Option 2: Tabby (GitHub Copilot Alternative)

VS Code Extension:

Installiere "Tabby" Extension
Settings:
- Endpoint: http://localhost:8080
- Token: Leer lassen

Features:

Inline Code-Completion
Function/Class Suggestions
Multi-Line Completions

Option 3: Continue.dev (empfohlen)

Installation:

VS Code Extension "Continue" installieren
Config kopieren:

mkdir -p ~/.continue
cp continue-config/config.json ~/.continue/config.json
# GPU_SERVER_IP anpassen!
sed -i 's/GPU_SERVER_IP/192.168.1.xxx/g' ~/.continue/config.json

Features:

/explain - Code erklären
/test - Tests generieren
/review - Code Review
/edit - Code bearbeiten
RAG-Integration (nutzt eure Codebase)

Shortcuts:

Cmd/Ctrl + L - Chat öffnen
Cmd/Ctrl + I - Inline Edit
Highlight + Cmd/Ctrl + Shift + L - Context hinzufügen

Option 4: RAG API (mit Codebase-Kontext)

curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "How do we handle authentication in our backend?"}
    ],
    "use_rag": true
  }'

Die API sucht automatisch in eurer Codebase nach relevanten Code-Beispielen!

Monitoring & Debugging

Logs anzeigen

GPU-Server:

docker-compose -f docker-compose-gpu.yml logs -f vllm-server
docker-compose -f docker-compose-gpu.yml logs -f ollama

Service-Server:

docker-compose -f docker-compose-services.yml logs -f code-indexer
docker-compose -f docker-compose-services.yml logs -f tabby
docker-compose -f docker-compose-services.yml logs -f continue-server

Qdrant UI

Öffne http://localhost:6333/dashboard

Sieh indexierte Code-Snippets
Teste Vektorsuche
Prüfe Collection-Statistiken

Health Checks

# vLLM
curl http://GPU_SERVER_IP:8000/health

# Services
curl http://localhost/health

Performance-Tipps

GPU-Auslastung optimieren

In docker-compose-gpu.yml:

command: >
  --model deepseek-ai/deepseek-coder-6.7b-instruct
  --gpu-memory-utilization 0.95  # Mehr GPU nutzen
  --max-model-len 16384          # Längerer Kontext
  --tensor-parallel-size 1

Schnellere Antworten

KV-Cache aktivieren: Ist standardmäßig an

Quantization nutzen:

--quantization awq  # Oder gptq, falls verfügbar

Batch-Size erhöhen:
```
--max-num-seqs 8
```

Mehr Repositories indexieren

Bearbeite code-indexer/indexer.py:

INDEX_INTERVAL = 1800  # Alle 30 Minuten statt 60
CODE_EXTENSIONS = {
    '.py', '.js', '.ts', '.go', '.rs',  # Füge mehr hinzu
    '.java', '.cpp', '.c', '.h'
}

Troubleshooting

vLLM startet nicht

Problem: Out of Memory

# Prüfe GPU-Speicher
nvidia-smi

# Reduziere Context Length
# In docker-compose-gpu.yml:
--max-model-len 4096  # Statt 8192

Problem: Model Download schlägt fehl

# Manuell downloaden
docker run -v ./models:/root/.cache/huggingface \
  ghcr.io/huggingface/text-generation-inference:latest \
  download-weights deepseek-ai/deepseek-coder-6.7b-instruct

Code-Indexer findet keine Repos

# Prüfe Forgejo Token
docker-compose -f docker-compose-services.yml logs code-indexer

# Test API-Zugriff
curl -H "Authorization: token YOUR_TOKEN" \
  http://localhost:3001/api/v1/user/repos

Tabby verbindet nicht zu vLLM

Prüfe tabby-config/config.toml
Ersetze GPU_SERVER_IP mit echter IP
Teste vLLM von Service-Server:

curl http://GPU_SERVER_IP:8000/v1/models

Continue.dev zeigt "Connection Error"

Prüfe ~/.continue/config.json
Stelle sicher dass apiBase korrekt ist
Teste API:

curl http://localhost:8082/v1/models

Erweiterungen

Eigenes Modell fine-tunen

# Nutze deine indizierten Code-Daten
docker exec -it code-indexer python3 /app/extract_training_data.py

# Fine-Tuning mit LoRA
# Siehe: ./fine-tuning/README.md (TODO)

Weitere MCP-Server integrieren

Füge in docker-compose-services.yml hinzu:

mcp-jira:
  image: your-jira-mcp-server
  environment:
    - JIRA_URL=...

Web-UI für Nicht-Entwickler

Open WebUI läuft bereits auf dem GPU-Server: http://GPU_SERVER_IP:3000

Kosten & Alternativen

Cloud vs. Self-Hosted (16GB VRAM)

Setup	Kosten/Monat	Latenz	Privacy
GTX 4060 (einmalig ~400€)	~10€ Strom	<100ms	✓ Vollständig
AWS g5.xlarge	~500€	~200ms	Shared
Anthropic Claude API	~50-200€	~500ms	Cloud

Alternative Modelle

Kleinere Modelle (weniger VRAM):

deepseek-ai/deepseek-coder-1.3b-instruct (3GB)
bigcode/starcoderbase-1b (2GB)

Größere Modelle (mit Quantization):

WizardLM/WizardCoder-15B-V1.0 (mit AWQ/GPTQ)
codellama/CodeLlama-13b-Instruct-hf (mit 4-bit)

FAQ

Q: Kann ich mehrere Modelle gleichzeitig laufen lassen? A: Ja, mit 16GB VRAM passen 2x 7B Modelle in 4-bit oder 1x 7B + 1x 1.3B in FP16.

Q: Wie schnell ist die Inferenz? A: GTX 4060: ~20-30 tokens/sec für 7B Modelle, ~40-50 für 1B.

Q: Funktioniert das auch mit AMD GPUs? A: Ja, nutze ROCm statt CUDA. Siehe vLLM ROCm Docs.

Q: Brauche ich wirklich RAG? A: Für kleine Codebases (<10 Repos) reicht oft Fine-Tuning. RAG hilft bei großen oder sich häufig ändernden Codebases.

Q: Kann ich das in Produktion nutzen? A: Ja, aber füge hinzu:

SSL/TLS (Let's Encrypt)
Authentifizierung
Rate Limiting
Backups
Monitoring (Prometheus/Grafana)