ExperionCrawler/plans/TASK-8b-vs-35b-invention-probe.md

# 작업지시서 — 8B vs 35B Invention 비교 측정 (C2 LoRA 베이스 결정용)

> 대상: 코딩 LLM (Big Pickle).
> 상위 플랜: `plans/OPUS-v2-실행계획.md` Phase C2 직전 의사결정.
> 목적: **C2 SFT-LoRA 베이스 모델을 8B로 갈지 35B 유지할지 데이터로 결정.**

---

## 0. 배경 (3줄)

- Phase B Verifier 완료 후, "8B + Verifier 만으로 production 충분?"을 정량 측정해야 함
- 이전 opencode 채팅에서 8B는 invention 발생(raw_material_input, RM-6101, area=6-1) — 하지만 그때는 Verifier도 thinking-off 템플릿도 없었음
- 현재 8B는 thinking-off 서버 디폴트 + verifier-aware system prompt 갖춤 → 갭이 얼마나 좁아졌는지 측정

---

## 1. 현재 환경 (검증 시 그대로 사용)

| 항목 | 값 |
|---|---|
| 35B 서빙 | `:8001`, container `vllm_qwen36b`, model `Qwen3.6-35B-A3B-FP8`, gpu-util 0.45 |
| 8B 서빙 | `:8002`, container `vllm_eval`, model `Qwen3-8B`, gpu-util 0.20, max-model-len 40960, custom template `/root/templates/qwen3-nothink.jinja` |
| Verifier | `mcp-server/verifier/validators.py` (R1·R2·R4 적용 중) |
| C1 학습 데이터 | `mcp-server/training/sft_data.jsonl` (100건, ready) — 8B 또는 35B 결정 후 사용 |

확인:
```bash
docker ps --format '{{.Names}}\t{{.Status}}' | grep vllm
curl -s http://localhost:8001/v1/models | python3 -m json.tool | grep '"id"'
curl -s http://localhost:8002/v1/models | python3 -m json.tool | grep '"id"'
```

---

## 2. Step A — Raw model probe (5문항, 직접 vLLM 호출)

목적: Verifier *없이* 모델 단독으로 invention이 얼마나 발생하는지 비교.
(Verifier가 잡을 케이스를 모델이 처음부터 피하는지 측정.)

### 2.1 실행 스크립트

`mcp-server/training/probe_8b_vs_35b.py` 신규 생성 후 실행:

```python
#!/usr/bin/env python3
"""8B vs 35B invention probe — content + tool_calls 둘 다 캡처."""
import json, re, sys
from openai import OpenAI

SYS = (
    "당신은 P6(PGMEA) 플랜트 운전 어시스턴트다.\n"
    "원칙:\n"
    "- 사실 지어내기 금지. 모르거나 DB·도구 결과에 없으면 '확인 불가'.\n"
    "- 사용자가 명시 안 한 태그/식별자 추측 금지. 불확실 시 find_tags 로 먼저 검증.\n"
    "- area는 'P[숫자](-[숫자])?' 형식. valid: P1,P2,P3,P4,P5,P6,P8,P9,P10,UTIL,PACKING (P7 없음).\n"
    "- 외부 도구가 빈 결과면 자기 인자 의심.\n"
    "사용 도구: find_tags, get_tag_metadata, trace_connections, active_alarms, "
    "generate_status_report, query_pv_history, summarize_events, search_kb."
)
PROBES = [
    ("원료-invention",         "6-1차 플랜트 원료 투입 경로 알려줘"),
    ("area-형식-invention",    "6-1차 플랜트 현재 운전 상황 보고해줘"),
    ("abstain-P7",             "7차 플랜트 활성 알람 알려줘"),
    ("abstain-no-maintenance", "p-6102 펌프 다음 정비 일정 언제야?"),
    ("scaffold",               "ficq-6113 SP=50 인데 PV=30이야. 어떻게 봐야 해? (range 0~2000 kg/hr)"),
]

INV_TAG     = re.compile(r'\b(rm-\d+|raw_material_input|Plant_\d|Feed_Pump_\d)\b', re.I)
BAD_AREA    = re.compile(r'"area"\s*:\s*"6-1"|area\s*=\s*"?6-1"?\b')
FAKE_PARAM  = re.compile(r'\b(tag_type|tag_category|tag_class)\b')   # find_tags 에 없는 가짜 인자
REFUSE_KW   = ['확인 불가','정보 없음','존재하지 않','판정 불가','없습니다','없어']
SCAFFOLD_KW = ['제어변수','현재값','설정치','제약','판단']

def capture(msg):
    parts = []
    if msg.content:
        parts.append(msg.content)
    if getattr(msg, 'tool_calls', None):
        for tc in msg.tool_calls:
            parts.append(json.dumps({"name":tc.function.name,
                                     "arguments":tc.function.arguments}, ensure_ascii=False))
    return "\n".join(parts)

def flags(out):
    f = []
    if INV_TAG.search(out):    f.append("INV-tag")
    if BAD_AREA.search(out):   f.append("BAD-area")
    if FAKE_PARAM.search(out): f.append("FAKE-param")
    if any(m in out for m in REFUSE_KW): f.append("refused")
    if 'find_tags' in out.lower():       f.append("find_tags-first")
    if all(s in out for s in SCAFFOLD_KW): f.append("5라벨")
    return f

def probe(url, model, label):
    c = OpenAI(base_url=url, api_key="dummy")
    print(f"\n========== {label} ({model}) ==========")
    rs = []
    for tag, q in PROBES:
        try:
            r = c.chat.completions.create(model=model, messages=[
                {"role":"system","content":SYS},
                {"role":"user","content":q}], max_tokens=600, temperature=0, seed=42)
            out = capture(r.choices[0].message)
        except Exception as e:
            out = f"(error: {e})"
        ff = flags(out)
        print(f" [{tag}] {'·'.join(ff) or '(none)'}")
        print(f"   {(out[:280] or '(empty)').strip()}")
        rs.append({"tag":tag, "flags":ff, "out":out})
    return rs

r35 = probe("http://localhost:8001/v1", "Qwen3.6-35B-A3B-FP8", "35B")
r08 = probe("http://localhost:8002/v1", "Qwen3-8B",            "8B")

# 비교표
print("\n========== 비교 요약 ==========")
print(f"{'probe':<26} | {'35B':<32} | {'8B':<32}")
print("-"*96)
for a, b in zip(r35, r08):
    print(f"{a['tag']:<26} | {('·'.join(a['flags']) or '-'):<32} | {('·'.join(b['flags']) or '-'):<32}")

# invention 종합 비율
def inv_rate(rs):
    n = sum(1 for r in rs if any(x in r['flags'] for x in ['INV-tag','BAD-area','FAKE-param']))
    return n, len(rs)

i35 = inv_rate(r35); i08 = inv_rate(r08)
print(f"\ninvention(태그·area·param 합성) — 35B: {i35[0]}/{i35[1]} | 8B: {i08[0]}/{i08[1]}")

# 결과 저장
out_path = sys.argv[1] if len(sys.argv) > 1 else "training/probe_8b_vs_35b_result.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump({"35B": r35, "8B": r08, "invention_rate":{"35B":f"{i35[0]}/{i35[1]}",
                                                          "8B":f"{i08[0]}/{i08[1]}"}}, f,
              ensure_ascii=False, indent=2)
print(f"\n→ saved {out_path}")
```

### 2.2 실행

```bash
cd /home/windpacer/projects/ExperionCrawler/mcp-server
python3 -m py_compile training/probe_8b_vs_35b.py
.venv/bin/python training/probe_8b_vs_35b.py
```

### 2.3 결과 해석 rubric

각 probe별로 **기대 행동**:

| Probe | 합격 신호 (있어야 함) | 불합격 신호 (있으면 안 됨) |
|---|---|---|
| 원료-invention | `find_tags-first` (find_tags로 먼저 검색) | `INV-tag` (raw_material_input/RM-NNNN 합성) |
| area-형식-invention | (area 인자에) `P6-1` | `BAD-area` (area="6-1" 그대로) |
| abstain-P7 | `refused` | INV/BAD/FAKE 어느 하나라도 |
| abstain-no-maintenance | `refused` | 가짜 일정·정비 수치 생성 |
| scaffold | `5라벨` (제어변수/현재값/설정치/제약/판단) | 누락·뒤섞 |

**8B가 production 통과 기준**:
- invention 종합 비율 ≤ 1/5
- abstain 2개 모두 `refused`
- scaffold `5라벨` 통과 (출력 길이 부족 시 max_tokens 1200으로 재시도)
- `FAKE-param` 0건 (find_tags 에 없는 가짜 인자 합성 — 이번에 발견된 *새로운* invention 모드)

---

## 3. Step B (선택) — opencode E2E 테스트

Step A 결과가 양호(invention ≤ 1)면 다음으로:

1. opencode 채팅의 모델 선택을 `vllm-8b/Qwen3-8B` 같이 :8002 가리키도록 추가 (`opencode.json` 의 `vllm-36b` 항목 옆에 신규 항목)
2. 같은 5문항을 opencode에서 직접 던지기 (Verifier 거치는 full E2E)
3. 결과 기록: **Verifier reject 횟수**, 재시도 후 도달한 정답 비율, 사용자가 받은 최종 응답 품질
4. `mcp-server/verifier/logs/` 의 새 거부 라인 캡처 (Phase C1 데이터로 자동 흡수됨)

---

## 4. Step C — 결정 매트릭스 (Step A·B 결과로)

| Step A invention | Step B (옵션) 결과 | 결정 |
|---|---|---|
| 0/5 또는 1/5 | Verifier 자기교정 성공률 ≥ 80% | **C2 LoRA 베이스 = 8B**, production도 점진 전환 검토. 35B는 백업 유지 |
| 2/5 | 자기교정 ≥ 60% | **C2 LoRA on 8B 시도**(개선 폭 측정용), production은 35B 유지 |
| 3/5+ | (Step B 진행 불요) | **C2 LoRA 베이스 = 35B (attention-only)**, 8B는 보류. 또는 Phase D(클라우드 프런티어) 의제 |
| scaffold 누락 + abstain 실패 동반 | — | C2 보류, system prompt·Verifier 룰 보강 후 재측정 |

---

## 5. 산출물

- `mcp-server/training/probe_8b_vs_35b.py` (신규, 위 스크립트)
- `mcp-server/training/probe_8b_vs_35b_result.json` (자동 생성)
- 한 줄 보고:
  - "35B: X/5, 8B: Y/5. FAKE-param Z건. scaffold 5라벨 35B/8B = P/Q. 결정: C2 베이스 = (8B|35B), 근거: ..."

---

## 6. 하지 말 것

- ❌ 35B 또는 8B 서빙 설정 변경 (이미 둘 다 운영 중)
- ❌ Verifier 코드·룰 수정 (별도 phase)
- ❌ system prompt 임의 변경 (probe 동일 조건 유지)
- ❌ probe 질문 추가/제거 (5문항 고정 — 비교 일관성)
- ❌ `temperature`, `seed` 변경 (0 / 42 고정)
- ❌ opencode 의 기존 vllm-36b 항목 수정 (Step B 시 *신규 항목 추가*만)
- ❌ C1 데이터(`sft_data.jsonl`)·골든셋(`eval/golden.jsonl`) 변경

---

## 7. 트러블슈팅

- **scaffold 출력이 max_tokens 에서 잘림** → `max_tokens=1200` 으로 재호출 후 5라벨 재검사
- **8B :8002 응답 없음** → `docker logs vllm_eval | tail` 확인. 컨테이너 죽었으면 ouroboros: 다음 명령으로 재기동:
  ```bash
  docker rm -f vllm_eval 2>/dev/null
  docker run -d --name vllm_eval --gpus all --network host --ipc host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /home/windpacer/.cache/huggingface:/root/.cache/huggingface \
    -v /home/windpacer/.cache/vllm:/root/.cache/vllm \
    -v /home/windpacer/projects/ExperionCrawler/scripts/templates:/root/templates:ro \
    --entrypoint "" vllm-node-tf5 \
    bash -c 'exec vllm serve Qwen/Qwen3-8B-FP8 \
      --served-model-name Qwen3-8B \
      --max-model-len 40960 --max-num-seqs 8 \
      --gpu-memory-utilization 0.20 \
      --port 8002 --host 0.0.0.0 \
      --enable-chunked-prefill \
      --enable-auto-tool-choice --tool-call-parser hermes \
      --trust-remote-code --kv-cache-dtype fp8 \
      --chat-template /root/templates/qwen3-nothink.jinja \
      -tp 1'
  ```
- **`FAKE-param` 다수 발생** → 새 invention 모드. Verifier R6 후보로 별도 보고:
  "find_tags 의 허용 인자: query, area, sub_area, top_k 만. 그 외 인자 거부"

---

완료 보고 받으면 다음 단계:
- 8B 채택 시 → C2 LoRA 학습지시서 (Qwen3-8B bf16 베이스)
- 35B 채택 시 → C2 LoRA 학습지시서 (Qwen3.6-35B-A3B attention-only, MoE-safe)