Troubleshooting
This section provides guidance for diagnosing and resolving common issues when working with the Lucid platform.
Quick Diagnostics
If you're having trouble with agents or auditors, start with these checks:
Agents (1):
NAME STATUS MODEL GPU
my-agent running llama3 H100
lucid status my-agentAgent: my-agent
ID: agent-abc123
Status: running
Model: llama3
GPU: H100
lucid logs my-agent[2024-01-15 10:30:00] Agent started
[2024-01-15 10:30:01] Auditors initialized
Common Issues
Agent Creation Fails
Symptoms:
- lucid apply returns an error
- Agent stuck in "pending" state
Solutions:
- Check authentication:
- Verify quota availability:
- Check that you have available GPU/compute quota in the selected region
-
Try a different region if resources are constrained
-
Check auditor images are notarized:
[+] Compliance probe successful!
[*] Verification complete. Auditor is compliant.
Local Environment Issues
Symptoms:
- lucid apply fails
- Kind cluster not starting
Solutions:
- Check prerequisites:
- Check cluster status:
Cluster: lucid-local-k8s
Status: Running
- Teardown and recreate:
Auditor Verification Fails
Symptoms:
- lucid auditor verify reports errors
- Missing labels or endpoints
Solutions:
-
Ensure required OCI labels:
LABEL io.lucid.auditor="true" LABEL io.lucid.schema_version="1.0" LABEL io.lucid.phase="request" LABEL io.lucid.interfaces="health,audit" -
Implement required endpoints:
GET /health- Must return200 OKwith{"status": "ok"}-
POST /audit- Main audit endpoint -
Check the container runs as non-root:
USER 1001
Passport Shows "Not Attested"
Symptoms:
- AI Passport shows hardware_attested: false
- Missing TEE information
Solutions:
- Verify agent is running on TEE hardware:
Hardware Attested: true
TEE Type: AMD SEV-SNP
- Contact support if you expected hardware attestation but it's not present.
CLI Authentication Issues
Symptoms:
- 401 Unauthorized errors
- Token expired messages
Solutions:
- Re-authenticate:
- Generate a new API key:
- Check credentials file:
Credentials are stored in
~/.lucid/config.yaml
Advanced Troubleshooting
Policy Evaluation Errors
Symptoms: - Requests blocked unexpectedly - Policy evaluation returns errors - Claims not matching expected values
Solutions:
-
Check policy syntax:
# Validate policy file locally lucid policy validate my-policy.yaml -
Enable policy debug logging:
# In your auditor configuration env: LUCID_LOG_LEVEL: debug LUCID_POLICY_DEBUG: "true" -
Inspect claim values in logs:
lucid logs my-agent | grep "policy_evaluation" -
Common policy issues:
| Error | Cause | Fix |
|---|---|---|
KeyError: 'claim_name' |
Claim not produced by auditor | Check auditor produces required claims |
TypeError in condition |
Type mismatch in rule | Ensure claim types match (string vs int) |
PolicyNotFound |
Policy not registered | Run lucid policy push to register |
- Test policy locally:
from lucid_sdk import PolicyEngine, load_policy policy = load_policy("my-policy.yaml") engine = PolicyEngine(policy) # Test with sample claims test_claims = {"location.country": {"value": "US"}} result = engine.evaluate(test_claims) print(f"Decision: {result.decision}")
Multi-Auditor Chain Debugging
Symptoms: - Requests failing at unknown point in chain - Auditors timing out - Chain not executing in expected order
Solutions:
- Check chain configuration:
lucid status my-agent --verbose
Output shows the auditor chain order and ports:
Audit Chain:
1. injection-auditor (port 8090) → 8093
2. toxicity-auditor (port 8093) → 5000
3. model (port 5000)
-
Enable request tracing:
# In environment YAML services: observability: enabled: true tracing: true -
View per-auditor timing:
lucid logs my-agent --filter auditor
Look for timing information:
[injection-auditor] Request processed in 45ms
[toxicity-auditor] Request processed in 120ms
-
Debug individual auditor:
# Test auditor directly (bypass chain) curl -X POST http://localhost:8090/audit \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "test"}]}' -
Common chain issues:
| Symptom | Cause | Fix |
|---|---|---|
| First auditor works, second fails | Port misconfiguration | Check LUCID_CHAIN_NEXT env var |
| Timeout after specific auditor | Auditor crash/hang | Check auditor logs, increase timeout |
| Wrong auditor order | YAML ordering | Auditors execute in chain list order |
Evidence Submission Failures
Symptoms:
- Evidence not appearing in Verifier
- EvidenceSubmissionError in logs
- AI Passport missing claims
Solutions:
-
Check Verifier connectivity:
# From auditor pod curl https://verifier.lucid.sh/health -
Verify evidence schema:
from lucid_schemas import Evidence # Validate evidence before submission evidence = Evidence( auditor_id="my-auditor", claims=[...], # ... ) evidence.model_dump() # Raises ValidationError if invalid -
Check authentication:
# Verify API key is set echo $LUCID_API_KEY -
Common evidence errors:
| Error | Cause | Fix |
|---|---|---|
ValidationError: claims |
Invalid claim format | Use lucid_schemas.Claim type |
401 Unauthorized |
Missing/invalid API key | Set LUCID_API_KEY env var |
413 Payload Too Large |
Evidence too large | Reduce claim data size |
409 Conflict |
Duplicate evidence ID | Ensure unique evidence_id |
- Enable submission logging:
env: LUCID_EVIDENCE_DEBUG: "true"
Attestation Verification Failures
Symptoms:
- lucid verify fails
- Attestation report invalid
- Hardware quote verification error
Solutions:
-
Check attestation status:
lucid verify environment env-abc123 --verbose -
Verify hardware support:
# Check if running on TEE-capable hardware lucid status my-agent --hardware
Expected output for valid TEE:
Hardware: AMD EPYC (SEV-SNP enabled)
TEE Status: Active
Quote Valid: true
- Common attestation errors:
| Error | Cause | Fix |
|---|---|---|
QuoteVerificationFailed |
Invalid hardware quote | Ensure TEE hardware is genuine |
MeasurementMismatch |
Code tampered | Redeploy from verified image |
CertificateExpired |
Old attestation cert | Request fresh attestation |
NetworkError |
Can't reach Intel/AMD | Check firewall allows attestation endpoints |
- Mock mode issues:
If using TEE_PROVIDER=MOCK (local dev environment), attestation will show hardware_attested: false. This is expected for local development.
- Refresh attestation:
# Force new attestation report lucid verify environment env-abc123 --refresh
Operator Webhook Issues
Symptoms: - Pods not getting auditor sidecars - Webhook timeout errors - Certificate errors
Solutions:
-
Check webhook registration:
kubectl get mutatingwebhookconfigurations lucid-operator -o yaml -
Verify namespace label:
# Namespace must have this label for injection kubectl get namespace my-namespace --show-labels | grep lucid
Required label:
labels:
lucid.computing/enabled: "true"
-
Check operator logs:
kubectl logs -n lucid-system -l app=lucid-operator | grep webhook -
Certificate issues:
# Check certificate validity kubectl get secret lucid-operator-tls -n lucid-system -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
See Operator Webhook for detailed webhook documentation.
Health Check Endpoints
All Lucid services expose health endpoints that you can use for debugging:
| Service | Health Endpoint |
|---|---|
| Verifier API | https://verifier.lucid.sh/health |
| Observer UI | https://observer.lucid.sh/api/health |
| Auditors | /health on configured port |
Getting Help
If you can't resolve an issue:
- Check the GitHub Issues
- Search existing discussions
- Contact support at support@lucid.sh with:
- Error messages (redact sensitive data)
- Steps to reproduce
- Agent ID and passport IDs if applicable