量子化

概要

量子化（Quantization）とは、ニューラルネットワークモデルのパラメータ（重みや活性化値）を、高精度な浮動小数点数（FP32）から低ビット幅の整数型（INT8、INT4など）に変換することで、モデルのサイズを小さくし、推論速度を高め、消費電力を削減する最適化技術です。

量子化は組み込み・エッジAIにおける最重要技術の一つであり、以下の恩恵をもたらします。

効果	FP32→INT8変換の場合
モデルサイズ削減	約1/4（4倍小さく）
推論速度向上	2〜4倍高速化（CPU）、最大8倍（NPU）
メモリ帯域削減	約1/4（キャッシュ効率向上）
消費電力削減	数倍〜数十倍の改善
精度低下	通常0.1〜2%（タスクによる）

量子化はモデルの学習を再度行わずに適用できる「ポスト学習量子化（PTQ：Post-Training Quantization）」と、量子化を考慮しながら学習する「量子化認識学習（QAT：Quantization-Aware Training）」の二種類が主流です。

歴史・背景

ニューラルネットワークの量子化に関する研究は2016〜2017年頃から活発化しました。BinaryNetやXNOR-Netなどの1bitネットワークの研究が先行しましたが、精度低下が大きく実用化は限定的でした。

量子化研究の主な流れ：

2016年：BinaryNet（重みを±1に二値化）の提案
2017年：GoogleがINT8量子化をTensorFlowに統合、実用化の道筋が開ける
2018年：Jacob et al.による「Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference」論文でINT8量子化の理論が確立
2019年：TFLite Micro向けINT8完全量子化（入出力含む）が実現
2020年：GPTQ、SmoothQuantなどの高精度量子化手法が登場
2022年以降：LLM（大規模言語モデル）向けの4bit量子化（GPTQ、AWQ）が急速に発展

技術仕様

量子化の数学的定義

実数値（浮動小数点）から整数への変換は以下の式で表されます。

線形量子化（アフィン量子化）：

量子化（Q）: q = clamp(round(r / s) + z, q_min, q_max)
逆量子化（D）: r ≈ s × (q - z)

記号定義:
  r = 実数値（real value, FP32）
  q = 量子化値（quantized value, INT8）
  s = スケール（scale factor, FP32の正値）
  z = ゼロ点（zero point, 整数）
  q_min = -128（INT8符号付き）
  q_max = 127（INT8符号付き）

スケールとゼロ点の計算（対称量子化の場合）：

import numpy as np

def compute_scale_zeropoint(data, n_bits=8, symmetric=True):
    """量子化パラメータを計算"""
    q_min = -(2 ** (n_bits - 1))      # INT8: -128
    q_max = (2 ** (n_bits - 1)) - 1   # INT8: 127
    
    r_min = data.min()
    r_max = data.max()
    
    if symmetric:
        # 対称量子化: ゼロ点=0
        abs_max = max(abs(r_min), abs(r_max))
        scale = abs_max / q_max
        zero_point = 0
    else:
        # 非対称量子化: ゼロ点≠0
        scale = (r_max - r_min) / (q_max - q_min)
        zero_point = round(q_min - r_min / scale)
        zero_point = int(np.clip(zero_point, q_min, q_max))
    
    return scale, zero_point

def quantize(data, scale, zero_point, n_bits=8):
    """実数→整数量子化"""
    q_min = -(2 ** (n_bits - 1))
    q_max = (2 ** (n_bits - 1)) - 1
    q = np.round(data / scale) + zero_point
    return np.clip(q, q_min, q_max).astype(np.int8)

def dequantize(q_data, scale, zero_point):
    """整数→実数逆量子化"""
    return scale * (q_data.astype(np.float32) - zero_point)

量子化の粒度（Granularity）

量子化パラメータ（スケール・ゼロ点）をどの単位で設定するかが精度に大きく影響します。

粒度	説明	精度	効率
テンソルごと（Per-tensor）	テンソル全体で1組のパラメータ	低	最高
チャンネルごと（Per-channel）	出力チャンネル単位でパラメータ	高	高
グループごと（Per-group）	N個の要素ごとにパラメータ	最高	中

# Per-channel量子化（チャンネルごとにスケールを持つ）
# Conv2Dの重み: shape = [出力ch, 入力ch, kH, kW]
weight = np.random.randn(64, 32, 3, 3)

scales = []
zero_points = []
quantized_weight = np.zeros_like(weight, dtype=np.int8)

for oc in range(weight.shape[0]):
    scale, zp = compute_scale_zeropoint(
        weight[oc], symmetric=True
    )
    scales.append(scale)
    zero_points.append(zp)
    quantized_weight[oc] = quantize(weight[oc], scale, zp)

データ型別の特性比較

データ型	ビット幅	値範囲	精度	モデルサイズ	速度向上
FP32	32bit	±3.4e38	基準	基準	1x
FP16	16bit	±65504	高	1/2	2x
BF16	16bit	±3.4e38	FP32近似	1/2	2x
INT8	8bit	-128〜127	中	1/4	2〜4x
INT4	4bit	-8〜7	低め	1/8	4〜8x
INT2/1	2/1bit	限定	低	1/16	理論最大

動作原理

ポスト学習量子化（PTQ）の種類

1. ダイナミック量子化（最も簡単）

重みのみ事前に量子化し、活性化値は推論時に動的に量子化します。

import torch

model = torch.load('model.pth')
model.eval()

# PyTorchダイナミック量子化
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.LSTM},  # 対象レイヤー
    dtype=torch.qint8
)
torch.save(quantized_model, 'model_dynamic_int8.pth')

2. 静的量子化（最も高速）

重みも活性化値も事前に量子化。キャリブレーションデータが必要。

import torch
from torch.quantization import prepare, convert

model = MyModel()
model.eval()

# 量子化設定
model.qconfig = torch.quantization.get_default_qconfig('x86')

# 量子化の準備（観測ノードを挿入）
model_prepared = prepare(model)

# キャリブレーション（代表データを1パス流す）
with torch.no_grad():
    for data, _ in calibration_loader:
        model_prepared(data)

# 量子化モデルに変換
model_int8 = convert(model_prepared)
print(f"FP32モデルサイズ: {get_model_size(model):.1f}MB")
print(f"INT8モデルサイズ: {get_model_size(model_int8):.1f}MB")

3. 量子化認識学習（QAT：最も高精度）

学習中に量子化をシミュレートすることで、量子化誤差を学習で補正します。

# QAT（Quantization-Aware Training）
model_qat = copy.deepcopy(model)
model_qat.train()
model_qat.qconfig = torch.quantization.get_default_qat_qconfig('x86')

# フェイク量子化ノードを挿入
model_qat_prepared = torch.quantization.prepare_qat(model_qat)

# ファインチューニング（数エポック追加学習）
optimizer = torch.optim.SGD(model_qat_prepared.parameters(), lr=1e-4)
for epoch in range(5):
    for data, labels in train_loader:
        output = model_qat_prepared(data)
        loss = criterion(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# INT8モデルに変換
model_qat_int8 = convert(model_qat_prepared.eval())

TFLite向けINT8量子化

import tensorflow as tf
import numpy as np

# 代表データセットの準備（キャリブレーション用）
def representative_dataset_gen():
    dataset = tf.data.Dataset.from_tensor_slices(
        calibration_images
    ).batch(1)
    for data in dataset.take(200):
        yield [data]

# コンバーター設定
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen

# 完全INT8量子化（入出力含む）
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# 変換実行
tflite_model = converter.convert()
with open('model_full_int8.tflite', 'wb') as f:
    f.write(tflite_model)

用途・ユースケース

マイコン（MCU）向け推論

Cortex-M4/M7マイコンではFP32の浮動小数点演算も可能ですが、INT8演算はCMSIS-NNの__SMLAD命令で2倍並列化できます。Cortex-M55のMVE（Helium）ではINT8を最大16並列処理できます。

Cortex-M4 @ 168MHz の推論性能比較（小型CNNの場合）:
  FP32: 80ms
  INT8（CMSIS-NN）: 15ms → 5.3倍高速化

NPU向け量子化

NPUは一般にINT8またはINT16演算に特化しており、FP32をそのままNPUで実行することはできません。NPUを活用するにはINT8量子化が必須です。

LLM（大規模言語モデル）の4bit量子化

2023年以降、LLMをPC・スマートフォン・エッジデバイスで動かすために4bit量子化が急速に普及しています。

# llama.cppを使ったLLMの4bit量子化（例）
# $ ./quantize llama-7b-f32.gguf llama-7b-q4_k_m.gguf Q4_K_M
# 
# モデルサイズ比較:
# FP32: 28GB → Q4_K_M: 4.1GB（約1/7に圧縮）

実装・開発のポイント

量子化の精度評価

def evaluate_accuracy(interpreter, test_dataset):
    """量子化前後の精度を比較"""
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    
    is_quantized = (
        input_details[0]['dtype'] == np.int8
    )
    
    correct = 0
    total = 0
    
    for images, labels in test_dataset:
        for image, label in zip(images, labels):
            if is_quantized:
                # INT8モデルは入力を量子化
                scale = input_details[0]['quantization'][0]
                zp = input_details[0]['quantization'][1]
                image_q = (image / scale + zp).astype(np.int8)
                interpreter.set_tensor(
                    input_details[0]['index'],
                    image_q[np.newaxis]
                )
            else:
                interpreter.set_tensor(
                    input_details[0]['index'],
                    image[np.newaxis]
                )
            
            interpreter.invoke()
            pred = np.argmax(
                interpreter.get_tensor(output_details[0]['index'])
            )
            if pred == label:
                correct += 1
            total += 1
    
    return correct / total

fp32_acc = evaluate_accuracy(fp32_interpreter, test_dataset)
int8_acc = evaluate_accuracy(int8_interpreter, test_dataset)
print(f"FP32精度: {fp32_acc*100:.2f}%")
print(f"INT8精度: {int8_acc*100:.2f}%")
print(f"精度低下: {(fp32_acc - int8_acc)*100:.2f}%")

量子化が難しいレイヤー

一部のレイヤーは量子化後に精度が大きく低下することがあります。

レイヤー	量子化の難しさ	対処法
Softmax	△（出力分布が偏る）	FP32で実行
LayerNorm	△	Per-tokenスケール
Attention（QKV積）	× （値域が広い）	FP16 or 特殊量子化
最初の畳み込み	△	Per-channel量子化
残差加算	○	スケール合わせが必要

他技術との比較

量子化はプルーニングと並ぶモデル圧縮の二大技術です。

比較軸	量子化	プルーニング	知識蒸留
効果	4〜8倍圧縮	2〜10倍圧縮	タスク依存
実装難易度	低〜中	中〜高	高
再学習の必要性	不要（PTQ）or あり（QAT）	必要	必要（学習全体）
ハードウェア対応	NPU/DSP必須で最大効果	汎用的	ハードウェア非依存
精度劣化	小（INT8）	小〜中	小（うまく行けば）

量子化はTensorFlow Lite・ONNX・推論アクセラレータと組み合わせることでエッジAIの実用化に直結します。NPUの性能を最大限に引き出すために不可欠な技術です。

概要