开源大模型 Grok-1

基本参数:

  • Parameters: 314B
  • Architecture: Mixture of 8 Experts (MoE) 混合专家模型
  • Experts Utilization: 2 experts used per token
  • Layers: 64
  • Attention Heads: 48 for queries, 8 for keys/values
  • Embedding Size: 6,144
  • Tokenization: SentencePiece tokenizer with 131,072 tokens
  • Additional Features:
    • Rotary embeddings (RoPE)
    • Supports activation sharding and 8-bit quantization
  • Maximum Sequence Length (context): 8,192 tokens

项目架构:

  • run.py -> runner.py -> checkpoint.py -> model.py
  • run.py(控制参数配置和输入输出): params -> TransforerConfig() -> LanguageModelConfig() -> ModelRunner() -> InferenceRunner() -> initialize() -> run() -> sample_of_model() -> output
  • runner.py(模型加载和推理): ModelRunner, InferenceRunner
  • checkpoint.py: 权重加载
  • model.py(主要模型Transformer): InOutEmbed + Transformer

模型架构:

  • 目前看完了上层架构,多层注意力和混合专家后续更。