Merge branch 'main' into esp32s3_korvo_2_v3

2026-01-14 09:17:20 +08:00 · 2024-10-28 10:53:11 +08:00 · 2024-10-28 10:53:11 +08:00 · cff159fec1
commit cff159fec1
parent 43b1e86a25 fe05a039a2
30 changed files with 3764 additions and 821 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,3 +8,4 @@ sdkconfig.old

 dependencies.lock
 .env
+releases/
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -4,7 +4,7 @@
 # CMakeLists in this exact order for cmake to work correctly
 cmake_minimum_required(VERSION 3.16)

-set(PROJECT_VER "0.3.1")
+set(PROJECT_VER "0.4.1")

 add_compile_options(-Wno-error=format= -Wno-format)

--- a/README.md
+++ b/README.md
@ -1,9 +1,11 @@
 # 小智 AI 聊天机器人

-BiliBili 视频介绍 [【ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！】](https://www.bilibili.com/video/BV11msTenEH3/?share_source=copy_web&vd_source=ee1aafe19d6e60cf22e60a93881faeba)
-
 这是虾哥的第一个硬件作品。

+[ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！【bilibili】](https://www.bilibili.com/video/BV11msTenEH3/?share_source=copy_web&vd_source=ee1aafe19d6e60cf22e60a93881faeba)
+
+[手工打造你的 AI 女友，新手入门教程【bilibili】](https://www.bilibili.com/video/BV1XnmFYLEJN/)
+
 ## 项目目的

 本项目基于乐鑫的 ESP-IDF 进行开发。
@ -18,16 +20,16 @@ BiliBili 视频介绍 [【ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！

 - Wi-Fi 配网
 - 支持 BOOT 键唤醒和打断
- 离线语音唤醒（使用乐鑫方案）
+- 离线语音唤醒（乐鑫方案）
 - 流式语音对话（WebSocket 协议）
- 支持国语、粤语、英语、日语、韩语 5 种语言识别（使用 SenseVoice 方案）
+- 支持国语、粤语、英语、日语、韩语 5 种语言识别（SenseVoice 方案）
 - 声纹识别（识别是谁在喊 AI 的名字，[3D Speaker 项目](https://github.com/modelscope/3D-Speaker)）
- 使用大模型 TTS（火山引擎方案，阿里云接入中）
+- 使用大模型 TTS（火山引擎与 CosyVoice 方案）
 - 支持可配置的提示词和音色（自定义角色）
- 免费提供 Qwen2.5 72B 和 豆包模型（受限于性能和额度，人多后可能会限额）
+- Qwen2.5 72B 或 豆包 API
 - 支持每轮对话后自我总结，生成记忆体
- 扩展液晶显示屏，显示信号强弱（后面可以显示中文字幕）
- 支持 ML307 Cat.1 4G 模块（可选）
+- 扩展液晶显示屏，显示信号强弱
+- 支持 ML307 Cat.1 4G 模块

 ## 硬件部分

@ -35,60 +37,29 @@ BiliBili 视频介绍 [【ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！

 [《小智 AI 聊天机器人百科全书》](https://ccnphfhqs21z.feishu.cn/wiki/F5krwD16viZoF0kKkvDcrZNYnhb?from=from_copylink)

-第二版接线图如下：
+面包板接线图如下：

-![第二版接线图](docs/wiring2.jpg)
+![面包板接线图](docs/wiring2.jpg)

 ## 固件部分

 ### 免开发环境烧录

-新手第一次操作建议先不要搭建开发环境，直接使用免开发环境烧录的固件。
+新手第一次操作建议先不要搭建开发环境，直接使用免开发环境烧录的固件。固件使用的是作者友情提供的测试服，目前开放免费使用，请勿用于商业用途。

-点击 [这里](https://github.com/78/xiaozhi-esp32/releases) 下载最新版固件。
+[Flash烧录固件（无IDF开发环境）](https://ccnphfhqs21z.feishu.cn/wiki/Zpz4wXBtdimBrLk25WdcXzxcnNS) 

-固件使用的是作者友情提供的测试服，目前开放免费使用，请勿用于商业用途。

-### 搭建开发环境
+### 开发环境

 - Cursor 或 VSCode
 - 安装 ESP-IDF 插件，选择 SDK 版本 5.3 或以上
 - Ubuntu 比 Windows 更好，编译速度快，也免去驱动问题的困扰

-### 配置项目与编译固件

- 目前只支持 ESP32 S3，Flash 至少 8MB, PSRAM 至少 2MB（注意：默认配置只兼容 8MB PSRAM，如果你使用 2MB PSRAM，需要修改配置，否则无法识别）
- 配置 OTA Version URL 为 `https://api.tenclass.net/xiaozhi/ota/`
- 配置 WebSocket URL 为 `wss://api.tenclass.net/xiaozhi/v1/`
- 配置 WebSocket Access Token 为 `test-token`
- 如果 INMP441 和 MAX98357 接线跟默认配置不一样，需要修改 GPIO 配置
- 配置完成后，编译固件
+## AI 角色配置

+如果你已经拥有一个小智 AI 聊天机器人，可以参考 [后台操作视频教程](https://www.bilibili.com/video/BV1jUCUY2EKM/)

-## 配置 Wi-Fi （4G 版本跳过）
-
-按照上述接线，烧录固件，设备上电后，开发板上的 RGB 会闪烁蓝灯（部分开发板需要焊接 RGB 灯的开关才会亮），进入配网状态。
-
-打开手机 Wi-Fi，连接上设备热点 `Xiaozhi-xxxx` 后，使用浏览器访问 `http://192.168.4.1`，进入配网页面。
-
-选择你的路由器 WiFi，输入密码，点击连接，设备会在 3 秒后自动重启，之后设备会自动连接到路由器。
-
-## 测试设备是否连接成功
-
-设备连接上路由器后，闪烁一下绿灯。此时，喊一声“你好，小智”，设备会先亮蓝灯（表示连接服务器），然后再亮绿灯，播放语音。
-
-如果没有亮蓝灯，说明麦克风有问题，请检查接线是否正确。
-
-如果没有亮绿灯，或者蓝灯常亮，说明设备没有连接到服务器，请检查 WiFi 连接是否正常。
-
-如果设备已经连接 Wi-Fi，但是没有声音，请检查是否接线正确。
-
-在 v0.2.1 版本之后的固件，也可以按下连接 GPIO 1 的按钮（低电平有效），进行录音测试。
-
-## 配置设备
-
-如果上述步骤测试成功，设备会播报你的设备 ID，你需要到 [小智测试服的控制面板](https://xiaozhi.tenclass.net/) 页面，添加设备。
-
-详细的使用说明以及测试服的注意事项，请参考 [小智测试服的帮助说明](https://xiaozhi.tenclass.net/help)。
-
+详细的使用说明以及测试服的注意事项，请参考 [小智测试服的帮助说明](https://xiaozhi.me/help)。

--- a/docs/wiring2.jpg
+++ b/docs/wiring2.jpg
--- a/main/Application.cc
+++ b/main/Application.cc
@ -1,4 +1,5 @@
 #include <BuiltinLed.h>
+#include <TcpTransport.h>
 #include <TlsTransport.h>
 #include <Ml307SslTransport.h>
 #include <WifiConfigurationAp.h>
@ -22,34 +23,33 @@ int answer_flag = 0;
 extern lv_obj_t *label1;
 extern lv_obj_t *label_reply;
 Application::Application()
-    : button_((gpio_num_t)CONFIG_BOOT_BUTTON_GPIO)
+    : boot_button_((gpio_num_t)CONFIG_BOOT_BUTTON_GPIO),
+      volume_up_button_((gpio_num_t)CONFIG_VOLUME_UP_BUTTON_GPIO),
+      volume_down_button_((gpio_num_t)CONFIG_VOLUME_DOWN_BUTTON_GPIO),
+#ifdef CONFIG_USE_DISPLAY
+      display_(CONFIG_DISPLAY_SDA_PIN, CONFIG_DISPLAY_SCL_PIN),
+#endif
 #ifdef CONFIG_USE_ML307
-      ,
      ml307_at_modem_(CONFIG_ML307_TX_PIN, CONFIG_ML307_RX_PIN, 4096),
      http_(ml307_at_modem_),
-      firmware_upgrade_(http_)
 #else
-      ,
      http_(),
+#endif
      firmware_upgrade_(http_)
-#endif
-#ifdef CONFIG_USE_DISPLAY
-      ,
-      display_(CONFIG_DISPLAY_SDA_PIN, CONFIG_DISPLAY_SCL_PIN)
-#endif
 {
    event_group_ = xEventGroupCreate();

-    opus_encoder_.Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 1);
+    opus_encoder_.Configure(16000, 1);
    opus_decoder_ = opus_decoder_create(opus_decode_sample_rate_, 1, NULL);
-    if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE)
-    {
-        opus_resampler_.Configure(opus_decode_sample_rate_, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
+    if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE) {
+        output_resampler_.Configure(CONFIG_AUDIO_OUTPUT_SAMPLE_RATE, opus_decode_sample_rate_);
+    }
+    if (16000 != CONFIG_AUDIO_INPUT_SAMPLE_RATE) {
+        input_resampler_.Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 16000);
    }

    firmware_upgrade_.SetCheckVersionUrl(CONFIG_OTA_VERSION_URL);
    firmware_upgrade_.SetHeader("Device-Id", SystemInfo::GetMacAddress().c_str());
-    firmware_upgrade_.SetPostData(SystemInfo::GetJsonString());
 }

 Application::~Application()
@ -199,8 +199,29 @@ void Application::Start()
    ml307_at_modem_.ResetConnections();
    ml307_at_modem_.WaitForNetworkReady();

-    ESP_LOGI(TAG, "ML307 IMEI: %s", ml307_at_modem_.GetImei().c_str());
-    ESP_LOGI(TAG, "ML307 ICCID: %s", ml307_at_modem_.GetIccid().c_str());
+    std::string imei = ml307_at_modem_.GetImei();
+    std::string iccid = ml307_at_modem_.GetIccid();
+    ESP_LOGI(TAG, "ML307 IMEI: %s", imei.c_str());
+    ESP_LOGI(TAG, "ML307 ICCID: %s", iccid.c_str());
+
+    // If low power, the material ready event will be triggered by the modem because of a reset
+    ml307_at_modem_.OnMaterialReady([this]() {
+        ESP_LOGI(TAG, "ML307 material ready");
+        Schedule([this]() {
+            SetChatState(kChatStateIdle);
+        });
+    });
+
+    // Set the board type for OTA
+    std::string carrier_name = ml307_at_modem_.GetCarrierName();
+    int csq = ml307_at_modem_.GetCsq();
+    std::string board_json = std::string("{\"type\":\"compact.4g\",");
+    board_json += "\"revision\":\"" + module_name + "\",";
+    board_json += "\"carrier\":\"" + carrier_name + "\",";
+    board_json += "\"csq\":\"" + std::to_string(csq) + "\",";
+    board_json += "\"imei\":\"" + imei + "\",";
+    board_json += "\"iccid\":\"" + iccid + "\"}";
+    firmware_upgrade_.SetBoardJson(board_json);
 #else
    // Try to connect to WiFi, if failed, launch the WiFi configuration AP
    auto &wifi_station = WifiStation::GetInstance();
@ -224,34 +245,59 @@ void Application::Start()
        wifi_ap.Start();
        return;
    }
+
+    // Set the board type for OTA
+    std::string board_json = std::string("{\"type\":\"compact.wifi\",");
+    board_json += "\"ssid\":\"" + wifi_station.GetSsid() + "\",";
+    board_json += "\"rssi\":" + std::to_string(wifi_station.GetRssi()) + ",";
+    board_json += "\"channel\":" + std::to_string(wifi_station.GetChannel()) + ",";
+    board_json += "\"ip\":\"" + wifi_station.GetIpAddress() + "\",";
+    board_json += "\"mac\":\"" + SystemInfo::GetMacAddress() + "\"}";
+    firmware_upgrade_.SetBoardJson(board_json);
 #endif
    label_ask_set_text("网络连接成功");
-    audio_device_.OnInputData([this](const int16_t *data, int size)
-                              {
+    audio_device_.Initialize();
+    audio_device_.OnInputData([this](std::vector<int16_t>&& data) {
+        if (16000 != CONFIG_AUDIO_INPUT_SAMPLE_RATE) {
+            if (audio_device_.input_channels() == 2) {
+                auto left_channel = std::vector<int16_t>(data.size() / 2);
+                auto right_channel = std::vector<int16_t>(data.size() / 2);
+                for (size_t i = 0, j = 0; i < left_channel.size(); ++i, j += 2) {
+                    left_channel[i] = data[j];
+                    right_channel[i] = data[j + 1];
+                }
+                auto resampled_left = std::vector<int16_t>(input_resampler_.GetOutputSamples(left_channel.size()));
+                auto resampled_right = std::vector<int16_t>(input_resampler_.GetOutputSamples(right_channel.size()));
+                input_resampler_.Process(left_channel.data(), left_channel.size(), resampled_left.data());
+                input_resampler_.Process(right_channel.data(), right_channel.size(), resampled_right.data());
+                data.resize(resampled_left.size() + resampled_right.size());
+                for (size_t i = 0, j = 0; i < resampled_left.size(); ++i, j += 2) {
+                    data[j] = resampled_left[i];
+                    data[j + 1] = resampled_right[i];
+                }
+            } else {
+                auto resampled = std::vector<int16_t>(input_resampler_.GetOutputSamples(data.size()));
+                input_resampler_.Process(data.data(), data.size(), resampled.data());
+                data = std::move(resampled);
+            }
+        }
 #ifdef CONFIG_USE_AFE_SR
-                                  if (audio_processor_.IsRunning())
-                                  {
-                                      audio_processor_.Input(data, size);
-                                  }
-                                  if (wake_word_detect_.IsDetectionRunning())
-                                  {
-                                      wake_word_detect_.Feed(data, size);
-                                  }
+        if (audio_processor_.IsRunning()) {
+            audio_processor_.Input(data);
+        }
+        if (wake_word_detect_.IsDetectionRunning()) {
+            wake_word_detect_.Feed(data);
+        }
 #else
-                                  std::vector<int16_t> pcm(data, data + size);
-                                  Schedule([this, pcm = std::move(pcm)]()
-                                           {
+        Schedule([this, data = std::move(data)]() {
            if (chat_state_ == kChatStateListening) {
                std::lock_guard<std::mutex> lock(mutex_);
-                audio_encode_queue_.emplace_back(std::move(pcm));
+                audio_encode_queue_.emplace_back(std::move(data));
                cv_.notify_all();
            } });
 #endif
                              });

-    // Initialize the audio device
-    audio_device_.Start(CONFIG_AUDIO_INPUT_SAMPLE_RATE, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
-
    // OPUS encoder / decoder use a lot of stack memory
    const size_t opus_stack_size = 4096 * 8;
    audio_encode_task_stack_ = (StackType_t *)heap_caps_malloc(opus_stack_size, MALLOC_CAP_SPIRAM | MALLOC_CAP_8BIT);
@ -265,12 +311,13 @@ void Application::Start()
                {
        Application* app = (Application*)arg;
        app->AudioPlayTask();
-        vTaskDelete(NULL); }, "play_audio", 4096 * 2, this, 5, NULL);
+        vTaskDelete(NULL);
+    }, "play_audio", 4096 * 4, this, 4, NULL);

 #ifdef CONFIG_USE_AFE_SR
-    wake_word_detect_.OnVadStateChange([this](bool speaking)
-                                       { Schedule([this, speaking]()
-                                                  {
+    wake_word_detect_.Initialize(audio_device_.input_channels(), audio_device_.input_reference());
+    wake_word_detect_.OnVadStateChange([this](bool speaking) {
+        Schedule([this, speaking]() {
            auto& builtin_led = BuiltinLed::GetInstance();
            if (chat_state_ == kChatStateListening) {
                if (speaking) {
@ -314,9 +361,9 @@ void Application::Start()
            wake_word_detect_.StartDetection(); }); });
    wake_word_detect_.StartDetection();

-    audio_processor_.OnOutput([this](std::vector<int16_t> &&data)
-                              { Schedule([this, data = std::move(data)]()
-                                         {
+    audio_processor_.Initialize(audio_device_.input_channels(), audio_device_.input_reference());
+    audio_processor_.OnOutput([this](std::vector<int16_t>&& data) {
+        Schedule([this, data = std::move(data)]() {
            if (chat_state_ == kChatStateListening) {
                std::lock_guard<std::mutex> lock(mutex_);
                audio_encode_queue_.emplace_back(std::move(data));
@ -328,9 +375,8 @@ void Application::Start()
    builtin_led.SetGreen();
    builtin_led.BlinkOnce();

-    button_.OnClick([this]()
-                    { Schedule([this]()
-                               {
+    boot_button_.OnClick([this]() {
+        Schedule([this]() {
            if (chat_state_ == kChatStateIdle) {
                SetChatState(kChatStateConnecting);
                StartWebSocketClient();
@ -354,8 +400,51 @@ void Application::Start()
                }
            } }); });

-    xTaskCreate([](void *arg)
-                {
+    volume_up_button_.OnClick([this]() {
+        Schedule([this]() {
+            auto volume = audio_device_.output_volume() + 10;
+            if (volume > 100) {
+                volume = 100;
+            }
+            audio_device_.SetOutputVolume(volume);
+#ifdef CONFIG_USE_DISPLAY
+            display_.ShowNotification("Volume\n" + std::to_string(volume));
+#endif
+        });
+    });
+
+    volume_up_button_.OnLongPress([this]() {
+        Schedule([this]() {
+            audio_device_.SetOutputVolume(100);
+#ifdef CONFIG_USE_DISPLAY
+            display_.ShowNotification("Volume\n100");
+#endif
+        });
+    });
+
+    volume_down_button_.OnClick([this]() {
+        Schedule([this]() {
+            auto volume = audio_device_.output_volume() - 10;
+            if (volume < 0) {
+                volume = 0;
+            }
+            audio_device_.SetOutputVolume(volume);
+#ifdef CONFIG_USE_DISPLAY
+            display_.ShowNotification("Volume\n" + std::to_string(volume));
+#endif
+        });
+    });
+
+    volume_down_button_.OnLongPress([this]() {
+        Schedule([this]() {
+            audio_device_.SetOutputVolume(0);
+#ifdef CONFIG_USE_DISPLAY
+            display_.ShowNotification("Volume\n0");
+#endif
+        });
+    });
+
+    xTaskCreate([](void* arg) {
        Application* app = (Application*)arg;
        app->MainLoop();
        vTaskDelete(NULL); }, "main_loop", 4096 * 2, this, 5, NULL);
@ -482,11 +571,13 @@ BinaryProtocol *Application::AllocateBinaryProtocol(const uint8_t *payload, size
 void Application::AudioEncodeTask()
 {
    ESP_LOGI(TAG, "Audio encode task started");
-    while (true)
-    {
+    const int max_audio_play_queue_size_ = 2;
+
+    while (true) {
        std::unique_lock<std::mutex> lock(mutex_);
-        cv_.wait(lock, [this]()
-                 { return !audio_encode_queue_.empty() || !audio_decode_queue_.empty(); });
+        cv_.wait(lock, [this]() {
+            return !audio_encode_queue_.empty() || (!audio_decode_queue_.empty() && audio_play_queue_.size() < max_audio_play_queue_size_);
+        });

        if (!audio_encode_queue_.empty())
        {
@ -500,7 +591,9 @@ void Application::AudioEncodeTask()
                auto protocol = AllocateBinaryProtocol(opus, opus_size);
                Schedule([this, protocol, opus_size]() {
                    if (ws_client_ && ws_client_->IsConnected()) {
-                        ws_client_->Send(protocol, sizeof(BinaryProtocol) + opus_size, true);
+                        if (!ws_client_->Send(protocol, sizeof(BinaryProtocol) + opus_size, true)) {
+                            ESP_LOGE(TAG, "Failed to send audio data");
+                        }
                    }
                    heap_caps_free(protocol);
                }); });
@ -522,11 +615,10 @@ void Application::AudioEncodeTask()
                continue;
            }

-            if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE)
-            {
-                int target_size = opus_resampler_.GetOutputSamples(frame_size);
+            if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE) {
+                int target_size = output_resampler_.GetOutputSamples(frame_size);
                std::vector<int16_t> resampled(target_size);
-                opus_resampler_.Process(packet->pcm.data(), frame_size, resampled.data());
+                output_resampler_.Process(packet->pcm.data(), frame_size, resampled.data());
                packet->pcm = std::move(resampled);
            }

@ -551,9 +643,7 @@ void Application::HandleAudioPacket(AudioPacket *packet)
        // This will block until the audio device has finished playing the audio
        audio_device_.OutputData(packet->pcm);

-        if (break_speaking_)
-        {
-            break_speaking_ = false;
+        if (break_speaking_) {
            skip_to_end_ = true;

            // Play a silence and skip to the end
@ -565,13 +655,16 @@ void Application::HandleAudioPacket(AudioPacket *packet)
        break;
    }
    case kAudioPacketTypeStart:
-        Schedule([this]()
-                 { SetChatState(kChatStateSpeaking); });
+        break_speaking_ = false;
+        skip_to_end_ = false;
+        Schedule([this]() {
+            SetChatState(kChatStateSpeaking);
+        });
        break;
    case kAudioPacketTypeStop:
-        skip_to_end_ = false;
-        Schedule([this]()
-                 { SetChatState(kChatStateListening); });
+        Schedule([this]() {
+            SetChatState(kChatStateListening);
+        });
        break;
    case kAudioPacketTypeSentenceStart:
        ESP_LOGI(TAG, "<< %s", packet->text.c_str());
@ -606,6 +699,7 @@ void Application::AudioPlayTask()
                 { return !audio_play_queue_.empty(); });
        auto packet = std::move(audio_play_queue_.front());
        audio_play_queue_.pop_front();
+        cv_.notify_all();
        lock.unlock();

        HandleAudioPacket(packet);
@ -625,7 +719,7 @@ void Application::SetDecodeSampleRate(int sample_rate)
    if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE)
    {
        ESP_LOGI(TAG, "Resampling audio from %d to %d", opus_decode_sample_rate_, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
-        opus_resampler_.Configure(opus_decode_sample_rate_, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
+        output_resampler_.Configure(opus_decode_sample_rate_, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
    }
 }

@ -637,15 +731,20 @@ void Application::StartWebSocketClient()
        delete ws_client_;
    }

+    std::string url = CONFIG_WEBSOCKET_URL;
    std::string token = "Bearer " + std::string(CONFIG_WEBSOCKET_ACCESS_TOKEN);
 #ifdef CONFIG_USE_ML307
    ws_client_ = new WebSocket(new Ml307SslTransport(ml307_at_modem_, 0));
 #else
-    ws_client_ = new WebSocket(new TlsTransport());
+    if (url.find("wss://") == 0) {
+        ws_client_ = new WebSocket(new TlsTransport());
+    } else {
+        ws_client_ = new WebSocket(new TcpTransport());
+    }
 #endif
    ws_client_->SetHeader("Authorization", token.c_str());
-    ws_client_->SetHeader("Device-Id", SystemInfo::GetMacAddress().c_str());
    ws_client_->SetHeader("Protocol-Version", std::to_string(PROTOCOL_VERSION).c_str());
+    ws_client_->SetHeader("Device-Id", SystemInfo::GetMacAddress().c_str());

    ws_client_->OnConnected([this]()
                            {
@ -656,7 +755,7 @@ void Application::StartWebSocketClient()
        std::string message = "{";
        message += "\"type\":\"hello\",";
        message += "\"audio_params\":{";
-        message += "\"format\":\"opus\", \"sample_rate\":" + std::to_string(CONFIG_AUDIO_INPUT_SAMPLE_RATE) + ", \"channels\":1";
+        message += "\"format\":\"opus\", \"sample_rate\":16000, \"channels\":1";
        message += "}}";
        ws_client_->Send(message); });

@ -689,6 +788,10 @@ void Application::StartWebSocketClient()
                        if (sample_rate != NULL) {
                            SetDecodeSampleRate(sample_rate->valueint);
                        }
+
+                        // If the device is speaking, we need to break the speaking
+                        break_speaking_ = true;
+                        skip_to_end_ = true;
                    } else if (strcmp(state->valuestring, "stop") == 0) {
                        packet->type = kAudioPacketTypeStop;
                    } else if (strcmp(state->valuestring, "sentence_end") == 0) {
@ -711,7 +814,16 @@ memset(minimax_content, 0, sizeof(minimax_content));
        ESP_LOGI(TAG, "minimax_content: %s", minimax_content);
        label_ask_set_text(minimax_content);
                    }
+                } else if (strcmp(type->valuestring, "llm") == 0) {
+                    auto emotion = cJSON_GetObjectItem(root, "emotion");
+                    if (emotion != NULL) {
+                        ESP_LOGD(TAG, "EMOTION: %s", emotion->valuestring);
+                    }
+                } else {
+                    ESP_LOGW(TAG, "Unknown message type: %s", type->valuestring);
                }
+            } else {
+                ESP_LOGE(TAG, "Missing message type, data: %s", data);
            }
            cJSON_Delete(root);
        } });
@ -731,8 +843,7 @@ memset(minimax_content, 0, sizeof(minimax_content));
            SetChatState(kChatStateIdle);
        }); });

-    if (!ws_client_->Connect(CONFIG_WEBSOCKET_URL))
-    {
+    if (!ws_client_->Connect(url.c_str())) {
        ESP_LOGE(TAG, "Failed to connect to websocket server");
        return;
    }
--- a/main/Application.h
+++ b/main/Application.h
@ -1,7 +1,6 @@
 #ifndef _APPLICATION_H_
 #define _APPLICATION_H_

-#include "AudioDevice.h"
 #include <OpusEncoder.h>
 #include <OpusResampler.h>
 #include <WebSocket.h>
@ -17,6 +16,7 @@
 #include <list>
 #include <condition_variable>

+#include "BoxAudioDevice.h"
 #include "Display.h"
 #include "FirmwareUpgrade.h"

@ -85,8 +85,17 @@ private:
    Application();
    ~Application();

-    Button button_;
+    Button boot_button_;
+    Button volume_up_button_;
+    Button volume_down_button_;
+#ifdef CONFIG_AUDIO_CODEC_ES8311_ES7210
+    BoxAudioDevice audio_device_;
+#else
    AudioDevice audio_device_;
+#endif
+#ifdef CONFIG_USE_DISPLAY
+    Display display_;
+#endif
 #ifdef CONFIG_USE_AFE_SR
    WakeWordDetect wake_word_detect_;
    AudioProcessor audio_processor_;
@ -98,9 +107,6 @@ private:
    EspHttp http_;
 #endif
    FirmwareUpgrade firmware_upgrade_;
-#ifdef CONFIG_USE_DISPLAY
-    Display display_;
-#endif
    std::mutex mutex_;
    std::condition_variable_any cv_;
    std::list<std::function<void()>> main_tasks_;
@ -123,7 +129,8 @@ private:

    int opus_duration_ms_ = 60;
    int opus_decode_sample_rate_ = CONFIG_AUDIO_OUTPUT_SAMPLE_RATE;
-    OpusResampler opus_resampler_;
+    OpusResampler input_resampler_;
+    OpusResampler output_resampler_;

    TaskHandle_t check_new_version_task_ = nullptr;
    StaticTask_t check_new_version_task_buffer_;
--- a/main/AudioDevice.cc
+++ b/main/AudioDevice.cc
@ -1,155 +1,12 @@
 #include "AudioDevice.h"
 #include <esp_log.h>
 #include <cstring>
-#include "driver/gpio.h"
-#include "driver/i2s_std.h"
-#include "esp_system.h"
-#include "esp_check.h"
-#include "es8311.h"
-#include "driver/i2c.h"
-#include "es7210.h"
+#include <cmath>
 #define TAG "AudioDevice"

-
-
-/* Example configurations */
-#define EXAMPLE_RECV_BUF_SIZE (2400)
-#define EXAMPLE_SAMPLE_RATE (16000)
-#define EXAMPLE_MCLK_MULTIPLE (384) // If not using 24-bit data width, 256 should be enough
-#define EXAMPLE_MCLK_FREQ_HZ (EXAMPLE_SAMPLE_RATE * EXAMPLE_MCLK_MULTIPLE)
-#define EXAMPLE_VOICE_VOLUME 70
-
-
-#define ES7210_I2C_ADDR             (0x40)
-#define ES7210_SAMPLE_RATE          (48000)
-#define ES7210_I2S_FORMAT           ES7210_I2S_FMT_I2S
-#define ES7210_MCLK_MULTIPLE        (256)
-#define ES7210_BIT_WIDTH            ES7210_I2S_BITS_16B
-#define ES7210_MIC_BIAS             ES7210_MIC_BIAS_2V87
-#define ES7210_MIC_GAIN             ES7210_MIC_GAIN_9DB
-#define ES7210_ADC_VOLUME           (0)
-
-
-/* I2C port and GPIOs */
-#define I2C_NUM I2C_NUM_0
-
-#define I2C_SCL_IO (GPIO_NUM_18)
-#define I2C_SDA_IO (GPIO_NUM_17)
-
-/* I2S port and GPIOs */
-#define I2S_NUM I2S_NUM_0
-#define I2S_MCK_IO (GPIO_NUM_16)
-#define I2S_BCK_IO (GPIO_NUM_9)
-#define I2S_WS_IO (GPIO_NUM_45)
-#define I2S_DO_IO (GPIO_NUM_8)
-#define I2S_DI_IO (GPIO_NUM_10)
-static i2s_chan_handle_t tx_handle = NULL;
-static i2s_chan_handle_t rx_handle = NULL;
-static es7210_dev_handle_t es7210_handle = NULL;
-
-static void es7210_init(bool is_tdm)
-{
-    /* Create ES7210 device handle */
-    es7210_i2c_config_t es7210_i2c_conf = {
-        .i2c_port = I2C_NUM,
-        .i2c_addr = ES7210_I2C_ADDR
-    };
-    es7210_new_codec(&es7210_i2c_conf, &es7210_handle);
-
-    ESP_LOGI(TAG, "Configure ES7210 codec parameters");
-    es7210_codec_config_t codec_conf = {
-        .sample_rate_hz = ES7210_SAMPLE_RATE,
-        .mclk_ratio = ES7210_MCLK_MULTIPLE,
-        .i2s_format = ES7210_I2S_FORMAT,
-        .bit_width = ES7210_BIT_WIDTH,
-        .mic_bias = ES7210_MIC_BIAS,
-        .mic_gain = ES7210_MIC_GAIN,
-        .flags = {
-            .tdm_enable = 1}};
-    es7210_config_codec(es7210_handle, &codec_conf);
-    es7210_config_volume(es7210_handle, ES7210_ADC_VOLUME);
-}
-
-static esp_err_t es8311_codec_init(void)
-{
-    /* Initialize I2C peripheral */
-    // const i2c_config_t es_i2c_cfg = {
-    //     .mode = I2C_MODE_MASTER,
-    //     .sda_io_num = I2C_SDA_IO,
-    //     .scl_io_num = I2C_SCL_IO,
-    //     .sda_pullup_en = GPIO_PULLUP_ENABLE,
-    //     .scl_pullup_en = GPIO_PULLUP_ENABLE,
-    //     .master = {
-    //             .clk_speed = 400000,
-    //         }
-    // };
-    // ESP_RETURN_ON_ERROR(i2c_param_config(I2C_NUM, &es_i2c_cfg), TAG, "config i2c failed");
-    // ESP_RETURN_ON_ERROR(i2c_driver_install(I2C_NUM, I2C_MODE_MASTER, 0, 0, 0), TAG, "install i2c driver failed");
-
-    /* Initialize es8311 codec */
-    es8311_handle_t es_handle = es8311_create(I2C_NUM, ES8311_ADDRRES_0);
-    ESP_RETURN_ON_FALSE(es_handle, ESP_FAIL, TAG, "es8311 create failed");
-    const es8311_clock_config_t es_clk = {
-        .mclk_inverted = false,
-        .sclk_inverted = false,
-        .mclk_from_mclk_pin = true,
-        .mclk_frequency = EXAMPLE_MCLK_FREQ_HZ,
-        .sample_frequency = EXAMPLE_SAMPLE_RATE};
-
-    ESP_ERROR_CHECK(es8311_init(es_handle, &es_clk, ES8311_RESOLUTION_16, ES8311_RESOLUTION_16));
-    ESP_RETURN_ON_ERROR(es8311_sample_frequency_config(es_handle, EXAMPLE_SAMPLE_RATE * EXAMPLE_MCLK_MULTIPLE, EXAMPLE_SAMPLE_RATE), TAG, "set es8311 sample frequency failed");
-    ESP_RETURN_ON_ERROR(es8311_voice_volume_set(es_handle, EXAMPLE_VOICE_VOLUME, NULL), TAG, "set es8311 volume failed");
-    ESP_RETURN_ON_ERROR(es8311_microphone_config(es_handle, false), TAG, "set es8311 microphone failed");
-#if CONFIG_EXAMPLE_MODE_ECHO
-    ESP_RETURN_ON_ERROR(es8311_microphone_gain_set(es_handle, EXAMPLE_MIC_GAIN), TAG, "set es8311 microphone gain failed");
-#endif
-    return ESP_OK;
-}
-
-static esp_err_t i2s_driver_init(void)
-{
-    i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(I2S_NUM, I2S_ROLE_MASTER);
-    chan_cfg.auto_clear = true; // Auto clear the legacy data in the DMA buffer
-    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle, &rx_handle));
-    i2s_std_config_t std_cfg = {
-        .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(EXAMPLE_SAMPLE_RATE),
-        .slot_cfg = {
-            .data_bit_width = I2S_DATA_BIT_WIDTH_32BIT,
-            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
-            .slot_mode = I2S_SLOT_MODE_MONO,
-            .slot_mask = I2S_STD_SLOT_LEFT,
-            .ws_width = I2S_DATA_BIT_WIDTH_32BIT,
-            .ws_pol = false,
-            .bit_shift = true,
-            .left_align = true,
-            .big_endian = false,
-            .bit_order_lsb = false
-        },
-        .gpio_cfg = {
-            .mclk = I2S_MCK_IO,
-            .bclk = I2S_BCK_IO,
-            .ws = I2S_WS_IO,
-            .dout = I2S_DO_IO,
-            .din = I2S_DI_IO,
-            .invert_flags = {
-                .mclk_inv = false,
-                .bclk_inv = false,
-                .ws_inv = false,
-            },
-        },
-    };
-    std_cfg.clk_cfg.mclk_multiple = (i2s_mclk_multiple_t)EXAMPLE_MCLK_MULTIPLE;
-
-    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle, &std_cfg));
-    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle, &std_cfg));
-    ESP_ERROR_CHECK(i2s_channel_enable(tx_handle));
-    ESP_ERROR_CHECK(i2s_channel_enable(rx_handle));
-
-    return ESP_OK;
-}
-
-
-AudioDevice::AudioDevice() {
+AudioDevice::AudioDevice()
+    : input_sample_rate_(CONFIG_AUDIO_INPUT_SAMPLE_RATE),
+      output_sample_rate_(CONFIG_AUDIO_OUTPUT_SAMPLE_RATE) {
 }

 AudioDevice::~AudioDevice() {
@ -164,51 +21,16 @@ AudioDevice::~AudioDevice() {
    }
 }

-void AudioDevice::Start(int input_sample_rate, int output_sample_rate) {
-    input_sample_rate_ = input_sample_rate;
-    output_sample_rate_ = output_sample_rate;
-
-// #ifdef CONFIG_AUDIO_DEVICE_I2S_SIMPLEX
-//         CreateSimplexChannels();
-// #else
-//         CreateDuplexChannels();
-// #endif
-
-//     ESP_ERROR_CHECK(i2s_channel_enable(tx_handle_));
-//     ESP_ERROR_CHECK(i2s_channel_enable(rx_handle_));
-    printf("i2s es8311 codec example start\n-----------------------------\n");
-    /* Initialize i2s peripheral */
-    if (i2s_driver_init() != ESP_OK)
-    {
-        ESP_LOGE(TAG, "i2s driver init failed");
-        abort();
-    }
-    else
-    {
-        ESP_LOGI(TAG, "i2s driver init success");
-    }
-    /* Initialize i2c peripheral and config es8311 codec by i2c */
-    if (es8311_codec_init() != ESP_OK)
-    {
-        ESP_LOGE(TAG, "es8311 codec init failed");
-        abort();
-    }
-    else
-    {
-        ESP_LOGI(TAG, "es8311 codec init success");
-    }
-    es7210_init(true);
-    esp_rom_gpio_pad_select_gpio(GPIO_NUM_48);
-    gpio_set_direction(GPIO_NUM_48, GPIO_MODE_OUTPUT);
-    gpio_set_level(GPIO_NUM_48, 1); // 输出高电平
-
-    xTaskCreate([](void* arg) {
-        auto audio_device = (AudioDevice*)arg;
-        audio_device->InputTask();
-    }, "audio_input", 1024*10, this, 20, &audio_input_task_);
+void AudioDevice::Initialize() {
+#ifdef CONFIG_AUDIO_I2S_METHOD_SIMPLEX
+    CreateSimplexChannels();
+#else
+    CreateDuplexChannels();
+#endif
 }

 void AudioDevice::CreateDuplexChannels() {
+#ifdef CONFIG_AUDIO_I2S_METHOD_DUPLEX
    duplex_ = true;

    i2s_chan_config_t chan_cfg = {
@ -243,10 +65,10 @@ void AudioDevice::CreateDuplexChannels() {
        },
        .gpio_cfg = {
            .mclk = I2S_GPIO_UNUSED,
-            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_BCLK,
-            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_WS,
-            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_DOUT,
-            .din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_DIN,
+            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_BCLK,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_LRCK,
+            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DOUT,
+            .din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DIN,
            .invert_flags = {
                .mclk_inv = false,
                .bclk_inv = false,
@ -256,11 +78,14 @@ void AudioDevice::CreateDuplexChannels() {
    };
    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_enable(tx_handle_));
+    ESP_ERROR_CHECK(i2s_channel_enable(rx_handle_));
    ESP_LOGI(TAG, "Duplex channels created");
+#endif
 }

-#ifdef CONFIG_AUDIO_DEVICE_I2S_SIMPLEX
 void AudioDevice::CreateSimplexChannels() {
+#ifdef CONFIG_AUDIO_I2S_METHOD_SIMPLEX
    // Create a new channel for speaker
    i2s_chan_config_t chan_cfg = {
        .id = I2S_NUM_0,
@ -295,7 +120,7 @@ void AudioDevice::CreateSimplexChannels() {
        .gpio_cfg = {
            .mclk = I2S_GPIO_UNUSED,
            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_BCLK,
-            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_WS,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_LRCK,
            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_DOUT,
            .din = I2S_GPIO_UNUSED,
            .invert_flags = {
@ -311,24 +136,38 @@ void AudioDevice::CreateSimplexChannels() {
    chan_cfg.id = I2S_NUM_1;
    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, nullptr, &rx_handle_));
    std_cfg.clk_cfg.sample_rate_hz = (uint32_t)input_sample_rate_;
-    std_cfg.gpio_cfg.bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_BCLK;
+    std_cfg.gpio_cfg.bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_SCK;
    std_cfg.gpio_cfg.ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_WS;
    std_cfg.gpio_cfg.dout = I2S_GPIO_UNUSED;
    std_cfg.gpio_cfg.din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_DIN;
    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
-    ESP_LOGI(TAG, "Simplex channels created");
-}
-#endif

-void AudioDevice::Write(const int16_t* data, int samples) {
+    ESP_ERROR_CHECK(i2s_channel_enable(tx_handle_));
+    ESP_ERROR_CHECK(i2s_channel_enable(rx_handle_));
+    ESP_LOGI(TAG, "Simplex channels created");
+#endif
+}
+
+int AudioDevice::Write(const int16_t* data, int samples) {
    int32_t buffer[samples];
+
+    // output_volume_: 0-100
+    // volume_factor_: 0-65536
+    int32_t volume_factor = pow(double(output_volume_) / 100.0, 2) * 65536;
    for (int i = 0; i < samples; i++) {
-        buffer[i] = int32_t(data[i]) << 15;
+        int64_t temp = int64_t(data[i]) * volume_factor; // 使用 int64_t 进行乘法运算
+        if (temp > INT32_MAX) {
+            buffer[i] = INT32_MAX;
+        } else if (temp < INT32_MIN) {
+            buffer[i] = INT32_MIN;
+        } else {
+            buffer[i] = static_cast<int32_t>(temp);
+        }
    }

    size_t bytes_written;
-    ESP_ERROR_CHECK(i2s_channel_write(tx_handle, buffer, samples * sizeof(int32_t), &bytes_written, portMAX_DELAY));
-
+    ESP_ERROR_CHECK(i2s_channel_write(tx_handle_, buffer, samples * sizeof(int32_t), &bytes_written, portMAX_DELAY));
+    return bytes_written / sizeof(int32_t);
 }

 int AudioDevice::Read(int16_t* dest, int samples) {
@ -348,8 +187,16 @@ int AudioDevice::Read(int16_t* dest, int samples) {
    return samples;
 }

-void AudioDevice::OnInputData(std::function<void(const int16_t*, int)> callback) {
+void AudioDevice::OnInputData(std::function<void(std::vector<int16_t>&& data)> callback) {
    on_input_data_ = callback;
+
+    // 创建音频输入任务
+    if (audio_input_task_ == nullptr) {
+        xTaskCreate([](void* arg) {
+            auto audio_device = (AudioDevice*)arg;
+            audio_device->InputTask();
+        }, "audio_input", 4096 * 2, this, 3, &audio_input_task_);
+    }
 }

 void AudioDevice::OutputData(std::vector<int16_t>& data) {
@ -357,15 +204,20 @@ void AudioDevice::OutputData(std::vector<int16_t>& data) {
 }

 void AudioDevice::InputTask() {
-    int duration = 20;
-    int input_frame_size = input_sample_rate_ / 1000 * duration;
-    int16_t input_buffer[input_frame_size];
-
-
+    int duration = 30;
+    int input_frame_size = input_sample_rate_ / 1000 * duration * input_channels_;
    while (true) {
-        int samples = Read(input_buffer, input_frame_size);
+        std::vector<int16_t> input_data(input_frame_size);
+        int samples = Read(input_data.data(), input_data.size());
        if (samples > 0) {
-            on_input_data_(input_buffer, samples);
+            if (on_input_data_) {
+                on_input_data_(std::move(input_data));
+            }
        }
    }
 }
+
+void AudioDevice::SetOutputVolume(int volume) {
+    output_volume_ = volume;
+    ESP_LOGI(TAG, "Set output volume to %d", output_volume_);
+}
--- a/main/AudioDevice.h
+++ b/main/AudioDevice.h
@ -2,7 +2,6 @@
 #define _AUDIO_DEVICE_H

 #include <freertos/FreeRTOS.h>
-#include <freertos/event_groups.h>
 #include <driver/i2s_std.h>

 #include <vector>
@ -12,33 +11,42 @@
 class AudioDevice {
 public:
    AudioDevice();
-    ~AudioDevice();
+    virtual ~AudioDevice();
+    virtual void Initialize();

-    void Start(int input_sample_rate, int output_sample_rate);
-    void OnInputData(std::function<void(const int16_t*, int)> callback);
+    void OnInputData(std::function<void(std::vector<int16_t>&& data)> callback);
    void OutputData(std::vector<int16_t>& data);
+    virtual void SetOutputVolume(int volume);

-    int input_sample_rate() const { return input_sample_rate_; }
-    int output_sample_rate() const { return output_sample_rate_; }
-    bool duplex() const { return duplex_; }
+    inline bool duplex() const { return duplex_; }
+    inline bool input_reference() const { return input_reference_; }
+    inline int input_sample_rate() const { return input_sample_rate_; }
+    inline int output_sample_rate() const { return output_sample_rate_; }
+    inline int input_channels() const { return input_channels_; }
+    inline int output_channels() const { return output_channels_; }
+    inline int output_volume() const { return output_volume_; }

 private:
+    TaskHandle_t audio_input_task_ = nullptr;
+    std::function<void(std::vector<int16_t>&& data)> on_input_data_; 
+
+    void InputTask();
+    void CreateSimplexChannels();
+
+protected:
    bool duplex_ = false;
+    bool input_reference_ = false;
    int input_sample_rate_ = 0;
    int output_sample_rate_ = 0;
+    int input_channels_ = 1;
+    int output_channels_ = 1;
+    int output_volume_ = 70;
    i2s_chan_handle_t tx_handle_ = nullptr;
    i2s_chan_handle_t rx_handle_ = nullptr;

-    TaskHandle_t audio_input_task_ = nullptr;
-    
-    EventGroupHandle_t event_group_;
-    std::function<void(const int16_t*, int)> on_input_data_;
-
-    void CreateDuplexChannels();
-    void CreateSimplexChannels();
-    void InputTask();
-    int Read(int16_t* dest, int samples);
-    void Write(const int16_t* data, int samples);
+    virtual void CreateDuplexChannels();
+    virtual int Read(int16_t* dest, int samples);
+    virtual int Write(const int16_t* data, int samples);
 };

 #endif // _AUDIO_DEVICE_H
--- a/main/AudioProcessor.cc
+++ b/main/AudioProcessor.cc
@ -8,6 +8,12 @@ static const char* TAG = "AudioProcessor";
 AudioProcessor::AudioProcessor()
    : afe_communication_data_(nullptr) {
    event_group_ = xEventGroupCreate();
+}
+
+void AudioProcessor::Initialize(int channels, bool reference) {
+    channels_ = channels;
+    reference_ = reference;
+    int ref_num = reference_ ? 1 : 0;

    afe_config_t afe_config = {
        .aec_init = false,
@ -21,18 +27,18 @@ AudioProcessor::AudioProcessor()
        .wakenet_model_name = NULL,
        .wakenet_model_name_2 = NULL,
        .wakenet_mode = DET_MODE_90,
-        .afe_mode = SR_MODE_LOW_COST,
-        .afe_perferred_core = 0,
-        .afe_perferred_priority = 5,
+        .afe_mode = SR_MODE_HIGH_PERF,
+        .afe_perferred_core = 1,
+        .afe_perferred_priority = 1,
        .afe_ringbuf_size = 50,
        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
        .afe_linear_gain = 1.0,
        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
        .pcm_config = {
-            .total_ch_num = 1,
-            .mic_num = 1,
-            .ref_num = 0,
-            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE,
+            .total_ch_num = channels_,
+            .mic_num = channels_ - ref_num,
+            .ref_num = ref_num,
+            .sample_rate = 16000,
        },
        .debug_init = false,
        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
@ -47,7 +53,7 @@ AudioProcessor::AudioProcessor()
        auto this_ = (AudioProcessor*)arg;
        this_->AudioProcessorTask();
        vTaskDelete(NULL);
-    }, "audio_communication", 4096 * 2, this, 5, NULL);
+    }, "audio_communication", 4096 * 2, this, 1, NULL);
 }

 AudioProcessor::~AudioProcessor() {
@ -57,10 +63,10 @@ AudioProcessor::~AudioProcessor() {
    vEventGroupDelete(event_group_);
 }

-void AudioProcessor::Input(const int16_t* data, int size) {
-    input_buffer_.insert(input_buffer_.end(), data, data + size);
+void AudioProcessor::Input(std::vector<int16_t>& data) {
+    input_buffer_.insert(input_buffer_.end(), data.begin(), data.end());

-    auto chunk_size = esp_afe_vc_v1.get_feed_chunksize(afe_communication_data_);
+    auto chunk_size = esp_afe_vc_v1.get_feed_chunksize(afe_communication_data_) * channels_;
    while (input_buffer_.size() >= chunk_size) {
        auto chunk = input_buffer_.data();
        esp_afe_vc_v1.feed(afe_communication_data_, chunk);
@ -92,6 +98,9 @@ void AudioProcessor::AudioProcessorTask() {
        xEventGroupWaitBits(event_group_, PROCESSOR_RUNNING, pdFALSE, pdTRUE, portMAX_DELAY);

        auto res = esp_afe_vc_v1.fetch(afe_communication_data_);
+        if ((xEventGroupGetBits(event_group_) & PROCESSOR_RUNNING) == 0) {
+            continue;
+        }
        if (res == nullptr || res->ret_value == ESP_FAIL) {
            if (res != nullptr) {
                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
--- a/main/AudioProcessor.h
+++ b/main/AudioProcessor.h
@ -15,7 +15,8 @@ public:
    AudioProcessor();
    ~AudioProcessor();

-    void Input(const int16_t* data, int size);
+    void Initialize(int channels, bool reference);
+    void Input(std::vector<int16_t>& data);
    void Start();
    void Stop();
    bool IsRunning();
@ -26,6 +27,8 @@ private:
    esp_afe_sr_data_t* afe_communication_data_ = nullptr;
    std::vector<int16_t> input_buffer_;
    std::function<void(std::vector<int16_t>&& data)> output_callback_;
+    int channels_;
+    bool reference_;

    void AudioProcessorTask();
 };
--- a/main/BoxAudioDevice.cc
+++ b/main/BoxAudioDevice.cc
@ -0,0 +1,232 @@
+#include "BoxAudioDevice.h"
+#include <esp_log.h>
+#include <cassert>
+
+static const char* TAG = "BoxAudioDevice";
+
+BoxAudioDevice::BoxAudioDevice() {
+}
+
+BoxAudioDevice::~BoxAudioDevice() {
+    ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    esp_codec_dev_delete(output_dev_);
+    ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    esp_codec_dev_delete(input_dev_);
+
+    audio_codec_delete_codec_if(in_codec_if_);
+    audio_codec_delete_ctrl_if(in_ctrl_if_);
+    audio_codec_delete_codec_if(out_codec_if_);
+    audio_codec_delete_ctrl_if(out_ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+
+    ESP_ERROR_CHECK(i2c_del_master_bus(i2c_master_handle_));
+}
+
+void BoxAudioDevice::Initialize() {
+    duplex_ = true; // 是否双工
+    input_reference_ = CONFIG_AUDIO_CODEC_INPUT_REFERENCE; // 是否使用参考输入，实现回声消除
+    input_channels_ = input_reference_ ? 2 : 1; // 输入通道数
+
+    // Initialize I2C peripheral
+    i2c_master_bus_config_t i2c_bus_cfg = {
+        .i2c_port = I2C_NUM_0,
+        .sda_io_num = (gpio_num_t)CONFIG_AUDIO_CODEC_I2C_SDA_PIN,
+        .scl_io_num = (gpio_num_t)CONFIG_AUDIO_CODEC_I2C_SCL_PIN,
+        .clk_source = I2C_CLK_SRC_DEFAULT,
+        .glitch_ignore_cnt = 7,
+        .intr_priority = 0,
+        .trans_queue_depth = 0,
+        .flags = {
+            .enable_internal_pullup = 1,
+        },
+    };
+    ESP_ERROR_CHECK(i2c_new_master_bus(&i2c_bus_cfg, &i2c_master_handle_));
+
+    CreateDuplexChannels();
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = I2C_NUM_0,
+        .addr = ES8311_CODEC_DEFAULT_ADDR,
+        .bus_handle = i2c_master_handle_,
+    };
+    out_ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(out_ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8311_codec_cfg_t es8311_cfg = {};
+    es8311_cfg.ctrl_if = out_ctrl_if_;
+    es8311_cfg.gpio_if = gpio_if_;
+    es8311_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_DAC;
+    es8311_cfg.pa_pin = CONFIG_AUDIO_CODEC_PA_PIN;
+    es8311_cfg.use_mclk = true;
+    es8311_cfg.hw_gain.pa_voltage = 5.0;
+    es8311_cfg.hw_gain.codec_dac_voltage = 3.3;
+    out_codec_if_ = es8311_codec_new(&es8311_cfg);
+    assert(out_codec_if_ != NULL);
+
+    esp_codec_dev_cfg_t dev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_OUT,
+        .codec_if = out_codec_if_,
+        .data_if = data_if_,
+    };
+    output_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(output_dev_ != NULL);
+
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, output_volume_));
+
+    // Play 16bit 1 channel
+    esp_codec_dev_sample_info_t fs = {
+        .bits_per_sample = 16,
+        .channel = 1,
+        .channel_mask = 0,
+        .sample_rate = (uint32_t)output_sample_rate_,
+        .mclk_multiple = 0,
+    };
+    ESP_ERROR_CHECK(esp_codec_dev_open(output_dev_, &fs));
+
+    // Input
+    i2c_cfg.addr = ES7210_CODEC_DEFAULT_ADDR;
+    in_ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(in_ctrl_if_ != NULL);
+
+    es7210_codec_cfg_t es7210_cfg = {};
+    es7210_cfg.ctrl_if = in_ctrl_if_;
+    es7210_cfg.mic_selected = ES7120_SEL_MIC1 | ES7120_SEL_MIC2 | ES7120_SEL_MIC3 | ES7120_SEL_MIC4;
+    in_codec_if_ = es7210_codec_new(&es7210_cfg);
+    assert(in_codec_if_ != NULL);
+
+    dev_cfg.dev_type = ESP_CODEC_DEV_TYPE_IN;
+    dev_cfg.codec_if = in_codec_if_;
+    input_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(input_dev_ != NULL);
+
+    fs.channel = 4;
+    if (input_channels_ == 1) {
+        fs.channel_mask = ESP_CODEC_DEV_MAKE_CHANNEL_MASK(0);
+    } else {
+        fs.channel_mask = ESP_CODEC_DEV_MAKE_CHANNEL_MASK(0) | ESP_CODEC_DEV_MAKE_CHANNEL_MASK(1);
+    }
+    ESP_ERROR_CHECK(esp_codec_dev_open(input_dev_, &fs));
+
+    ESP_ERROR_CHECK(esp_codec_dev_set_in_channel_gain(input_dev_, ESP_CODEC_DEV_MAKE_CHANNEL_MASK(0), 30.0));
+
+    ESP_LOGI(TAG, "BoxAudioDevice initialized");
+}
+
+void BoxAudioDevice::CreateDuplexChannels() {
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = 6,
+        .dma_frame_num = 240,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .ext_clk_freq_hz = 0,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = true,
+            .big_endian = false,
+            .bit_order_lsb = false
+        },
+        .gpio_cfg = {
+            .mclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_MCLK,
+            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_BCLK,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_LRCK,
+            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DOUT,
+            .din = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    i2s_tdm_config_t tdm_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)input_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .ext_clk_freq_hz = 0,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+            .bclk_div = 8,
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = i2s_tdm_slot_mask_t(I2S_TDM_SLOT0 | I2S_TDM_SLOT1 | I2S_TDM_SLOT2 | I2S_TDM_SLOT3),
+            .ws_width = I2S_TDM_AUTO_WS_WIDTH,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = false,
+            .big_endian = false,
+            .bit_order_lsb = false,
+            .skip_mask = false,
+            .total_slot = I2S_TDM_AUTO_SLOT_NUM
+        },
+        .gpio_cfg = {
+            .mclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_MCLK,
+            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_BCLK,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_LRCK,
+            .dout = I2S_GPIO_UNUSED,
+            .din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DIN,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_tdm_mode(rx_handle_, &tdm_cfg));
+    ESP_ERROR_CHECK(i2s_channel_enable(tx_handle_));
+    ESP_ERROR_CHECK(i2s_channel_enable(rx_handle_));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+int BoxAudioDevice::Read(int16_t *buffer, int samples) {
+    ESP_ERROR_CHECK(esp_codec_dev_read(input_dev_, (void*)buffer, samples * sizeof(int16_t)));
+    return samples;
+}
+
+int BoxAudioDevice::Write(const int16_t *buffer, int samples) {
+    ESP_ERROR_CHECK(esp_codec_dev_write(output_dev_, (void*)buffer, samples * sizeof(int16_t)));
+    return samples;
+}
+
+void BoxAudioDevice::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, volume));
+    AudioDevice::SetOutputVolume(volume);
+}
--- a/main/BoxAudioDevice.h
+++ b/main/BoxAudioDevice.h
@ -0,0 +1,36 @@
+#ifndef _BOX_AUDIO_DEVICE_H
+#define _BOX_AUDIO_DEVICE_H
+
+#include "AudioDevice.h"
+#include <driver/i2c_master.h>
+#include <driver/i2s_tdm.h>
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+
+
+class BoxAudioDevice : public AudioDevice {
+public:
+    BoxAudioDevice();
+    virtual ~BoxAudioDevice();
+    void Initialize() override;
+    void SetOutputVolume(int volume) override;
+
+private:
+    i2c_master_bus_handle_t i2c_master_handle_ = nullptr;
+
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* out_ctrl_if_ = nullptr;
+    const audio_codec_if_t* out_codec_if_ = nullptr;
+    const audio_codec_ctrl_if_t* in_ctrl_if_ = nullptr;
+    const audio_codec_if_t* in_codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t output_dev_ = nullptr;
+    esp_codec_dev_handle_t input_dev_ = nullptr;
+
+    void CreateDuplexChannels() override;
+    int Read(int16_t* dest, int samples) override;
+    int Write(const int16_t* data, int samples) override;
+};
+
+#endif // _BOX_AUDIO_DEVICE_H
--- a/main/Button.cc
+++ b/main/Button.cc
@ -6,8 +6,8 @@ static const char* TAG = "Button";
 Button::Button(gpio_num_t gpio_num) : gpio_num_(gpio_num) {
    button_config_t button_config = {
        .type = BUTTON_TYPE_GPIO,
-        .long_press_time = 3000,
-        .short_press_time = 100,
+        .long_press_time = 1000,
+        .short_press_time = 50,
        .gpio_button_config = {
            .gpio_num = gpio_num,
            .active_level = 0
--- a/main/CMakeLists.txt
+++ b/main/CMakeLists.txt
@ -16,6 +16,9 @@ set(SOURCES "AudioDevice.cc"
 if(CONFIG_USE_AFE_SR)
    list(APPEND SOURCES "AudioProcessor.cc" "WakeWordDetect.cc")
 endif()
+if(CONFIG_AUDIO_CODEC_ES8311_ES7210)
+    list(APPEND SOURCES "BoxAudioDevice.cc")
+endif()

 idf_component_register(SRCS ${SOURCES}
                    INCLUDE_DIRS "."
--- a/main/Display.cc
+++ b/main/Display.cc
@ -17,7 +17,7 @@ Display::Display(int sda_pin, int scl_pin) : sda_pin_(sda_pin), scl_pin_(scl_pin
    ESP_LOGI(TAG, "Display Pins: %d, %d", sda_pin_, scl_pin_);

    i2c_master_bus_config_t bus_config = {
-        .i2c_port = I2C_NUM_0,
+        .i2c_port = I2C_NUM_1,
        .sda_io_num = (gpio_num_t)sda_pin_,
        .scl_io_num = (gpio_num_t)scl_pin_,
        .clk_source = I2C_CLK_SRC_DEFAULT,
@ -104,20 +104,29 @@ Display::Display(int sda_pin, int scl_pin) : sda_pin_(sda_pin), scl_pin_(scl_pin
        lv_label_set_text(label_, "Initializing...");
        lv_obj_set_width(label_, disp_->driver->hor_res);
        lv_obj_set_height(label_, disp_->driver->ver_res);
-        lv_obj_set_style_text_line_space(label_, 0, 0);
-        lv_obj_set_style_pad_all(label_, 0, 0);
-        lv_obj_set_style_outline_pad(label_, 0, 0);
+
+        notification_ = lv_label_create(lv_disp_get_scr_act(disp_));
+        lv_label_set_text(notification_, "Notification\nTest");
+        lv_obj_set_width(notification_, disp_->driver->hor_res);
+        lv_obj_set_height(notification_, disp_->driver->ver_res);
+        lv_obj_set_style_opa(notification_, LV_OPA_MIN, 0);
        lvgl_port_unlock();
    }
 }

 Display::~Display() {
-    if (label_ != nullptr) {
-        lvgl_port_lock(0);
-        lv_obj_del(label_);
-        lvgl_port_unlock();
+    if (notification_timer_ != nullptr) {
+        esp_timer_stop(notification_timer_);
+        esp_timer_delete(notification_timer_);
    }

+    lvgl_port_lock(0);
+    if (label_ != nullptr) {
+        lv_obj_del(label_);
+        lv_obj_del(notification_);
+    }
+    lvgl_port_unlock();
+
    if (disp_ != nullptr) {
        lvgl_port_deinit();
        esp_lcd_panel_del(panel_);
@ -136,4 +145,35 @@ void Display::SetText(const std::string &text) {
    }
 }

+void Display::ShowNotification(const std::string &text) {
+    if (notification_ != nullptr) {
+        lvgl_port_lock(0);
+        lv_label_set_text(notification_, text.c_str());
+        lv_obj_set_style_opa(notification_, LV_OPA_MAX, 0);
+        lv_obj_set_style_opa(label_, LV_OPA_MIN, 0);
+        lvgl_port_unlock();
+
+        if (notification_timer_ != nullptr) {
+            esp_timer_stop(notification_timer_);
+            esp_timer_delete(notification_timer_);
+        }
+
+        esp_timer_create_args_t timer_args = {
+            .callback = [](void *arg) {
+                Display *display = static_cast<Display*>(arg);
+                lvgl_port_lock(0);
+                lv_obj_set_style_opa(display->notification_, LV_OPA_MIN, 0);
+                lv_obj_set_style_opa(display->label_, LV_OPA_MAX, 0);
+                lvgl_port_unlock();
+            },
+            .arg = this,
+            .dispatch_method = ESP_TIMER_TASK,
+            .name = "Notification Timer",
+            .skip_unhandled_events = false,
+        };
+        ESP_ERROR_CHECK(esp_timer_create(&timer_args, &notification_timer_));
+        ESP_ERROR_CHECK(esp_timer_start_once(notification_timer_, 3000000));
+    }
+}
+
 #endif
--- a/main/Display.h
+++ b/main/Display.h
@ -5,6 +5,7 @@
 #include <esp_lcd_panel_io.h>
 #include <esp_lcd_panel_ops.h>
 #include <lvgl.h>
+#include <esp_timer.h>

 #include <string>

@ -14,6 +15,7 @@ public:
    ~Display();

    void SetText(const std::string &text);
+    void ShowNotification(const std::string &text);

 private:
    int sda_pin_;
@ -25,6 +27,8 @@ private:
    esp_lcd_panel_handle_t panel_ = nullptr;
    lv_disp_t *disp_ = nullptr;
    lv_obj_t *label_ = nullptr;
+    lv_obj_t *notification_ = nullptr;
+    esp_timer_handle_t notification_timer_ = nullptr;

    std::string text_;
 };
--- a/main/FirmwareUpgrade.cc
+++ b/main/FirmwareUpgrade.cc
@ -6,6 +6,7 @@
 #include <esp_http_client.h>
 #include <esp_ota_ops.h>
 #include <esp_app_format.h>
+#include <esp_chip_info.h>

 #include <vector>
 #include <sstream>
@ -24,10 +25,6 @@ void FirmwareUpgrade::SetCheckVersionUrl(std::string check_version_url) {
    check_version_url_ = check_version_url;
 }

-void FirmwareUpgrade::SetPostData(const std::string& post_data) {
-    post_data_ = post_data;
-}
-
 void FirmwareUpgrade::SetHeader(const std::string& key, const std::string& value) {
    headers_[key] = value;
 }
@ -45,13 +42,9 @@ void FirmwareUpgrade::CheckVersion() {
        http_.SetHeader(header.first, header.second);
    }

-    if (post_data_.empty()) {
-        http_.Open("GET", check_version_url_);
-    } else {
-        http_.SetHeader("Content-Type", "application/json");
-        http_.SetContent(post_data_);
-        http_.Open("POST", check_version_url_);
-    }
+    http_.SetHeader("Content-Type", "application/json");
+    http_.SetContent(GetPostData());
+    http_.Open("POST", check_version_url_);

    auto response = http_.GetBody();
    http_.Close();
@ -257,3 +250,99 @@ bool FirmwareUpgrade::IsNewVersionAvailable(const std::string& currentVersion, c
    
    return newer.size() > current.size();
 }
+
+void FirmwareUpgrade::SetBoardJson(const std::string& board_json) {
+    board_json_ = board_json;
+}
+
+std::string FirmwareUpgrade::GetPostData() {
+    /* 
+        {
+            "flash_size": 4194304,
+            "psram_size": 0,
+            "minimum_free_heap_size": 123456,
+            "mac_address": "00:00:00:00:00:00",
+            "chip_model_name": "esp32s3",
+            "chip_info": {
+                "model": 1,
+                "cores": 2,
+                "revision": 0,
+                "features": 0
+            },
+            "application": {
+                "name": "my-app",
+                "version": "1.0.0",
+                "compile_time": "2021-01-01T00:00:00Z"
+                "idf_version": "4.2-dev"
+                "elf_sha256": ""
+            },
+            "partition_table": [
+                "app": {
+                    "label": "app",
+                    "type": 1,
+                    "subtype": 2,
+                    "address": 0x10000,
+                    "size": 0x100000
+                }
+            ],
+            "ota": {
+                "label": "ota_0"
+            }
+        }
+    */
+    std::string json = "{";
+    json += "\"flash_size\":" + std::to_string(SystemInfo::GetFlashSize()) + ",";
+    json += "\"minimum_free_heap_size\":" + std::to_string(SystemInfo::GetMinimumFreeHeapSize()) + ",";
+    json += "\"mac_address\":\"" + SystemInfo::GetMacAddress() + "\",";
+    json += "\"chip_model_name\":\"" + SystemInfo::GetChipModelName() + "\",";
+    json += "\"chip_info\":{";
+
+    esp_chip_info_t chip_info;
+    esp_chip_info(&chip_info);
+    json += "\"model\":" + std::to_string(chip_info.model) + ",";
+    json += "\"cores\":" + std::to_string(chip_info.cores) + ",";
+    json += "\"revision\":" + std::to_string(chip_info.revision) + ",";
+    json += "\"features\":" + std::to_string(chip_info.features);
+    json += "},";
+
+    json += "\"application\":{";
+    auto app_desc = esp_app_get_description();
+    json += "\"name\":\"" + std::string(app_desc->project_name) + "\",";
+    json += "\"version\":\"" + std::string(app_desc->version) + "\",";
+    json += "\"compile_time\":\"" + std::string(app_desc->date) + "T" + std::string(app_desc->time) + "Z\",";
+    json += "\"idf_version\":\"" + std::string(app_desc->idf_ver) + "\",";
+
+    char sha256_str[65];
+    for (int i = 0; i < 32; i++) {
+        snprintf(sha256_str + i * 2, sizeof(sha256_str) - i * 2, "%02x", app_desc->app_elf_sha256[i]);
+    }
+    json += "\"elf_sha256\":\"" + std::string(sha256_str) + "\"";
+    json += "},";
+
+    json += "\"partition_table\": [";
+    esp_partition_iterator_t it = esp_partition_find(ESP_PARTITION_TYPE_ANY, ESP_PARTITION_SUBTYPE_ANY, NULL);
+    while (it) {
+        const esp_partition_t *partition = esp_partition_get(it);
+        json += "{";
+        json += "\"label\":\"" + std::string(partition->label) + "\",";
+        json += "\"type\":" + std::to_string(partition->type) + ",";
+        json += "\"subtype\":" + std::to_string(partition->subtype) + ",";
+        json += "\"address\":" + std::to_string(partition->address) + ",";
+        json += "\"size\":" + std::to_string(partition->size);
+        json += "},";
+        it = esp_partition_next(it);
+    }
+    json.pop_back(); // Remove the last comma
+    json += "],";
+
+    json += "\"ota\":{";
+    auto ota_partition = esp_ota_get_running_partition();
+    json += "\"label\":\"" + std::string(ota_partition->label) + "\"";
+    json += "},";
+
+    json += "\"board\":" + board_json_;
+
+    // Close the JSON object
+    json += "}";
+    return json;
+}
--- a/main/FirmwareUpgrade.h
+++ b/main/FirmwareUpgrade.h
@ -12,8 +12,8 @@ public:
    FirmwareUpgrade(Http& http);
    ~FirmwareUpgrade();

+    void SetBoardJson(const std::string& board_json);
    void SetCheckVersionUrl(std::string check_version_url);
-    void SetPostData(const std::string& post_data);
    void SetHeader(const std::string& key, const std::string& value);
    void CheckVersion();
    bool HasNewVersion() { return has_new_version_; }
@ -26,13 +26,14 @@ private:
    bool has_new_version_ = false;
    std::string firmware_version_;
    std::string firmware_url_;
-    std::string post_data_;
+    std::string board_json_;
    std::map<std::string, std::string> headers_;

    void Upgrade(const std::string& firmware_url);
    std::function<void(int progress, size_t speed)> upgrade_callback_;
    std::vector<int> ParseVersion(const std::string& version);
    bool IsNewVersionAvailable(const std::string& currentVersion, const std::string& newVersion);
+    std::string GetPostData();
 };

 #endif // _FIRMWARE_UPGRADE_H
--- a/main/Kconfig.projbuild
+++ b/main/Kconfig.projbuild
@ -30,49 +30,136 @@ config AUDIO_OUTPUT_SAMPLE_RATE
    help
        Audio output sample rate.

-config AUDIO_DEVICE_I2S_MIC_GPIO_WS
-    int "I2S GPIO WS"
-    default 4
+choice AUDIO_CODEC
+    prompt "Audio Codec"
+    default AUDIO_CODEC_NONE
    help
-        GPIO number of the I2S WS.
+        Audio codec.
+    config AUDIO_CODEC_ES8311_ES7210
+        bool "Box: ES8311 + ES7210"
+    config AUDIO_CODEC_NONE
+        bool "None"
+endchoice

-config AUDIO_DEVICE_I2S_MIC_GPIO_BCLK
-    int "I2S GPIO BCLK"
-    default 5
-    help
-        GPIO number of the I2S BCLK.
+menu "Box Audio Codec I2C and PA Control"
+    depends on AUDIO_CODEC_ES8311_ES7210
+    
+    config AUDIO_CODEC_I2C_SDA_PIN
+        int "Audio Codec I2C SDA Pin"
+        default 39
+        help
+            Audio codec I2C SDA pin.

-config AUDIO_DEVICE_I2S_MIC_GPIO_DIN
-    int "I2S GPIO DIN"
-    default 6
-    help
-        GPIO number of the I2S DIN.
+    config AUDIO_CODEC_I2C_SCL_PIN
+        int "Audio Codec I2C SCL Pin"
+        default 38
+        help
+            Audio codec I2C SCL pin.
+    
+    config AUDIO_CODEC_PA_PIN
+        int "Audio Codec PA Pin"
+        default 40
+        help
+            Audio codec PA pin.
+    
+    config AUDIO_CODEC_INPUT_REFERENCE
+        bool "Audio Codec Input Reference"
+        default y
+        help
+            Audio codec input reference.
+endmenu

-config AUDIO_DEVICE_I2S_SPK_GPIO_DOUT
-    int "I2S GPIO DOUT"
-    default 7
+choice AUDIO_I2S_METHOD
+    prompt "Audio I2S Method"
+    default AUDIO_I2S_METHOD_SIMPLEX if AUDIO_CODEC_NONE
+    default AUDIO_I2S_METHOD_DUPLEX if AUDIO_CODEC_ES8311_ES7210
    help
-        GPIO number of the I2S DOUT.
-    
-config AUDIO_DEVICE_I2S_SIMPLEX
-    bool "I2S Simplex"
-    default y
-    help
-        Enable I2S Simplex mode.
-    
-config AUDIO_DEVICE_I2S_SPK_GPIO_BCLK
-    int "I2S SPK GPIO BCLK"
-    default 15
-    depends on AUDIO_DEVICE_I2S_SIMPLEX
-    help
-        GPIO number of the I2S MIC BCLK.
-    
-config AUDIO_DEVICE_I2S_SPK_GPIO_WS
-    int "I2S SPK GPIO WS"
-    default 16
-    depends on AUDIO_DEVICE_I2S_SIMPLEX
-    help
-        GPIO number of the I2S MIC WS.
+        Audio I2S method.
+    config AUDIO_I2S_METHOD_SIMPLEX
+        bool "Simplex"
+        help
+            Use I2S 0 as the audio input and I2S 1 as the audio output.
+    config AUDIO_I2S_METHOD_DUPLEX
+        bool "Duplex"
+        help
+            Use I2S 0 as the audio input and audio output.
+endchoice
+
+menu "Audio I2S Simplex"
+    depends on AUDIO_I2S_METHOD_SIMPLEX
+
+    config AUDIO_DEVICE_I2S_MIC_GPIO_WS
+        int "I2S MIC GPIO WS"
+        default 4
+        help
+            GPIO number of the I2S MIC WS.
+
+    config AUDIO_DEVICE_I2S_MIC_GPIO_SCK
+        int "I2S MIC GPIO BCLK"
+        default 5
+        help
+            GPIO number of the I2S MIC SCK.
+
+    config AUDIO_DEVICE_I2S_MIC_GPIO_DIN
+        int "I2S MIC GPIO DIN"
+        default 6
+        help
+            GPIO number of the I2S MIC DIN.
+
+    config AUDIO_DEVICE_I2S_SPK_GPIO_DOUT
+        int "I2S SPK GPIO DOUT"
+        default 7
+        help
+            GPIO number of the I2S SPK DOUT.
+        
+    config AUDIO_DEVICE_I2S_SPK_GPIO_BCLK
+        int "I2S SPK GPIO BCLK"
+        default 15
+        help
+            GPIO number of the I2S SPK BCLK.
+        
+    config AUDIO_DEVICE_I2S_SPK_GPIO_LRCK
+        int "I2S SPK GPIO WS"
+        default 16
+        help
+            GPIO number of the I2S SPK LRCK.
+
+endmenu
+
+menu "Audio I2S Duplex"
+    depends on AUDIO_I2S_METHOD_DUPLEX
+
+    config AUDIO_DEVICE_I2S_GPIO_MCLK
+        int "I2S GPIO MCLK"
+        default -1
+        help
+            GPIO number of the I2S WS.
+
+    config AUDIO_DEVICE_I2S_GPIO_LRCK
+        int "I2S GPIO LRCK"
+        default 4
+        help
+            GPIO number of the I2S LRCK.
+
+    config AUDIO_DEVICE_I2S_GPIO_BCLK
+        int "I2S GPIO BCLK / SCLK"
+        default 5
+        help
+            GPIO number of the I2S BCLK.
+
+    config AUDIO_DEVICE_I2S_GPIO_DIN
+        int "I2S GPIO DIN"
+        default 6
+        help
+            GPIO number of the I2S DIN.
+
+    config AUDIO_DEVICE_I2S_GPIO_DOUT
+        int "I2S GPIO DOUT"
+        default 7
+        help
+            GPIO number of the I2S DOUT.
+
+endmenu

 config BOOT_BUTTON_GPIO
    int "Boot Button GPIO"
@ -80,6 +167,18 @@ config BOOT_BUTTON_GPIO
    help
        GPIO number of the boot button.

+config VOLUME_UP_BUTTON_GPIO
+    int "Volume Up Button GPIO"
+    default 40
+    help
+        GPIO number of the volume up button.
+
+config VOLUME_DOWN_BUTTON_GPIO
+    int "Volume Down Button GPIO"
+    default 39
+    help
+        GPIO number of the volume down button.
+
 config USE_AFE_SR
    bool "Use Espressif AFE SR"
    default y
--- a/main/SystemInfo.cc
+++ b/main/SystemInfo.cc
@ -3,7 +3,6 @@
 #include <esp_log.h>
 #include <esp_flash.h>
 #include <esp_mac.h>
-#include <esp_chip_info.h>
 #include <esp_system.h>
 #include <esp_partition.h>
 #include <esp_app_desc.h>
@ -41,96 +40,6 @@ std::string SystemInfo::GetChipModelName() {
    return std::string(CONFIG_IDF_TARGET);
 }

-std::string SystemInfo::GetJsonString() {
-    /* 
-        {
-            "flash_size": 4194304,
-            "psram_size": 0,
-            "minimum_free_heap_size": 123456,
-            "mac_address": "00:00:00:00:00:00",
-            "chip_model_name": "esp32s3",
-            "chip_info": {
-                "model": 1,
-                "cores": 2,
-                "revision": 0,
-                "features": 0
-            },
-            "application": {
-                "name": "my-app",
-                "version": "1.0.0",
-                "compile_time": "2021-01-01T00:00:00Z"
-                "idf_version": "4.2-dev"
-                "elf_sha256": ""
-            },
-            "partition_table": [
-                "app": {
-                    "label": "app",
-                    "type": 1,
-                    "subtype": 2,
-                    "address": 0x10000,
-                    "size": 0x100000
-                }
-            ],
-            "ota": {
-                "label": "ota_0"
-            }
-        }
-    */
-    std::string json = "{";
-    json += "\"flash_size\":" + std::to_string(GetFlashSize()) + ",";
-    json += "\"minimum_free_heap_size\":" + std::to_string(GetMinimumFreeHeapSize()) + ",";
-    json += "\"mac_address\":\"" + GetMacAddress() + "\",";
-    json += "\"chip_model_name\":\"" + GetChipModelName() + "\",";
-    json += "\"chip_info\":{";
-
-    esp_chip_info_t chip_info;
-    esp_chip_info(&chip_info);
-    json += "\"model\":" + std::to_string(chip_info.model) + ",";
-    json += "\"cores\":" + std::to_string(chip_info.cores) + ",";
-    json += "\"revision\":" + std::to_string(chip_info.revision) + ",";
-    json += "\"features\":" + std::to_string(chip_info.features);
-    json += "},";
-
-    json += "\"application\":{";
-    auto app_desc = esp_app_get_description();
-    json += "\"name\":\"" + std::string(app_desc->project_name) + "\",";
-    json += "\"version\":\"" + std::string(app_desc->version) + "\",";
-    json += "\"compile_time\":\"" + std::string(app_desc->date) + "T" + std::string(app_desc->time) + "Z\",";
-    json += "\"idf_version\":\"" + std::string(app_desc->idf_ver) + "\",";
-
-    char sha256_str[65];
-    for (int i = 0; i < 32; i++) {
-        snprintf(sha256_str + i * 2, sizeof(sha256_str) - i * 2, "%02x", app_desc->app_elf_sha256[i]);
-    }
-    json += "\"elf_sha256\":\"" + std::string(sha256_str) + "\"";
-    json += "},";
-
-    json += "\"partition_table\": [";
-    esp_partition_iterator_t it = esp_partition_find(ESP_PARTITION_TYPE_ANY, ESP_PARTITION_SUBTYPE_ANY, NULL);
-    while (it) {
-        const esp_partition_t *partition = esp_partition_get(it);
-        json += "{";
-        json += "\"label\":\"" + std::string(partition->label) + "\",";
-        json += "\"type\":" + std::to_string(partition->type) + ",";
-        json += "\"subtype\":" + std::to_string(partition->subtype) + ",";
-        json += "\"address\":" + std::to_string(partition->address) + ",";
-        json += "\"size\":" + std::to_string(partition->size);
-        json += "},";
-        it = esp_partition_next(it);
-    }
-    json.pop_back(); // Remove the last comma
-    json += "],";
-
-    json += "\"ota\":{";
-    auto ota_partition = esp_ota_get_running_partition();
-    json += "\"label\":\"" + std::string(ota_partition->label) + "\"";
-    json += "}";
-
-    // Close the JSON object
-    json += "}";
-    return json;
-}
-
 esp_err_t SystemInfo::PrintRealTimeStats(TickType_t xTicksToWait) {
    #define ARRAY_SIZE_OFFSET 5
    TaskStatus_t *start_array = NULL, *end_array = NULL;
--- a/main/SystemInfo.h
+++ b/main/SystemInfo.h
@ -13,7 +13,6 @@ public:
    static size_t GetFreeHeapSize();
    static std::string GetMacAddress();
    static std::string GetChipModelName();
-    static std::string GetJsonString();
    static esp_err_t PrintRealTimeStats(TickType_t xTicksToWait);
 };

--- a/main/WakeWordDetect.cc
+++ b/main/WakeWordDetect.cc
@ -15,6 +15,24 @@ WakeWordDetect::WakeWordDetect()
      wake_word_opus_() {

    event_group_ = xEventGroupCreate();
+}
+
+WakeWordDetect::~WakeWordDetect() {
+    if (afe_detection_data_ != nullptr) {
+        esp_afe_sr_v1.destroy(afe_detection_data_);
+    }
+
+    if (wake_word_encode_task_stack_ != nullptr) {
+        free(wake_word_encode_task_stack_);
+    }
+
+    vEventGroupDelete(event_group_);
+}
+
+void WakeWordDetect::Initialize(int channels, bool reference) {
+    channels_ = channels;
+    reference_ = reference;
+    int ref_num = reference_ ? 1 : 0;

    srmodel_list_t *models = esp_srmodel_init("model");
    for (int i = 0; i < models->num; i++) {
@ -25,7 +43,7 @@ WakeWordDetect::WakeWordDetect()
    }

    afe_config_t afe_config = {
-        .aec_init = false,
+        .aec_init = reference_,
        .se_init = true,
        .vad_init = true,
        .wakenet_init = true,
@ -37,17 +55,17 @@ WakeWordDetect::WakeWordDetect()
        .wakenet_model_name_2 = NULL,
        .wakenet_mode = DET_MODE_90,
        .afe_mode = SR_MODE_HIGH_PERF,
-        .afe_perferred_core = 0,
-        .afe_perferred_priority = 5,
+        .afe_perferred_core = 1,
+        .afe_perferred_priority = 1,
        .afe_ringbuf_size = 50,
        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
        .afe_linear_gain = 1.0,
        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
        .pcm_config = {
-            .total_ch_num = 1,
-            .mic_num = 1,
-            .ref_num = 0,
-            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE
+            .total_ch_num = channels_,
+            .mic_num = channels_ - ref_num,
+            .ref_num = ref_num,
+            .sample_rate = 16000
        },
        .debug_init = false,
        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
@ -62,19 +80,7 @@ WakeWordDetect::WakeWordDetect()
        auto this_ = (WakeWordDetect*)arg;
        this_->AudioDetectionTask();
        vTaskDelete(NULL);
-    }, "audio_detection", 4096 * 2, this, 5, NULL);
-}
-
-WakeWordDetect::~WakeWordDetect() {
-    if (afe_detection_data_ != nullptr) {
-        esp_afe_sr_v1.destroy(afe_detection_data_);
-    }
-
-    if (wake_word_encode_task_stack_ != nullptr) {
-        free(wake_word_encode_task_stack_);
-    }
-
-    vEventGroupDelete(event_group_);
+    }, "audio_detection", 4096 * 2, this, 1, NULL);
 }

 void WakeWordDetect::OnWakeWordDetected(std::function<void()> callback) {
@ -97,10 +103,10 @@ bool WakeWordDetect::IsDetectionRunning() {
    return xEventGroupGetBits(event_group_) & DETECTION_RUNNING_EVENT;
 }

-void WakeWordDetect::Feed(const int16_t* data, int size) {
-    input_buffer_.insert(input_buffer_.end(), data, data + size);
+void WakeWordDetect::Feed(std::vector<int16_t>& data) {
+    input_buffer_.insert(input_buffer_.end(), data.begin(), data.end());

-    auto chunk_size = esp_afe_sr_v1.get_feed_chunksize(afe_detection_data_);
+    auto chunk_size = esp_afe_sr_v1.get_feed_chunksize(afe_detection_data_) * channels_;
    while (input_buffer_.size() >= chunk_size) {
        esp_afe_sr_v1.feed(afe_detection_data_, input_buffer_.data());
        input_buffer_.erase(input_buffer_.begin(), input_buffer_.begin() + chunk_size);
@ -166,7 +172,7 @@ void WakeWordDetect::EncodeWakeWordData() {
        auto start_time = esp_timer_get_time();
        // encode detect packets
        OpusEncoder* encoder = new OpusEncoder();
-        encoder->Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 1, 60);
+        encoder->Configure(16000, 1, 60);
        encoder->SetComplexity(0);
        this_->wake_word_opus_.resize(4096 * 4);
        size_t offset = 0;
--- a/main/WakeWordDetect.h
+++ b/main/WakeWordDetect.h
@ -19,7 +19,8 @@ public:
    WakeWordDetect();
    ~WakeWordDetect();

-    void Feed(const int16_t* data, int size);
+    void Initialize(int channels, bool reference);
+    void Feed(std::vector<int16_t>& data);
    void OnWakeWordDetected(std::function<void()> callback);
    void OnVadStateChange(std::function<void(bool speaking)> callback);
    void StartDetection();
@ -36,6 +37,8 @@ private:
    std::function<void()> wake_word_detected_callback_;
    std::function<void(bool speaking)> vad_state_change_callback_;
    bool is_speaking_ = false;
+    int channels_;
+    bool reference_;

    TaskHandle_t wake_word_encode_task_ = nullptr;
    StaticTask_t wake_word_encode_task_buffer_;
--- a/main/idf_component.yml
+++ b/main/idf_component.yml
@ -3,9 +3,10 @@ dependencies:
  espressif/iot_usbh_modem: "^0.2.1"
  espressif/esp_modem: "^1.1.0"
  78/esp-builtin-led: "^1.0.2"
-  78/esp-wifi-connect: "^1.1.0"
+  78/esp-wifi-connect: "^1.2.0"
  78/esp-opus-encoder: "^1.0.2"
-  78/esp-ml307: "^1.1.1"
+  78/esp-ml307: "^1.2.1"
+  espressif/esp_codec_dev: "^1.3.1"
  espressif/esp-sr: "^1.9.0"
  espressif/button: "^3.3.1"
  lvgl/lvgl: "^8.4.0"
--- a/main/main.cc
+++ b/main/main.cc
@ -154,6 +154,19 @@ static void on_modem_event(void *arg, esp_event_base_t event_base,

 extern "C" void app_main(void)
 {
+#ifdef CONFIG_AUDIO_CODEC_ES8311_ES7210
+    // Make GPIO15 HIGH to enable the 4G module
+    gpio_config_t ml307_enable_config = {
+        .pin_bit_mask = (1ULL << 15),
+        .mode = GPIO_MODE_OUTPUT,
+        .pull_up_en = GPIO_PULLUP_DISABLE,
+        .pull_down_en = GPIO_PULLDOWN_DISABLE,
+        .intr_type = GPIO_INTR_DISABLE,
+    };
+    gpio_config(&ml307_enable_config);
+    gpio_set_level(GPIO_NUM_15, 1);
+#endif
+
    // Check if the reset button is pressed
    SystemReset system_reset;
    system_reset.CheckButtons();
--- a/pack.py
+++ b/pack.py
@ -1,68 +0,0 @@
-#! /usr/bin/env python3
-
-import csv
-import os
-
-# 例如：1000, 0x1000, 1M
-def read_value(text):
-    text = text.strip()
-    if text.endswith('K'):
-        return int(text[:-1]) * 1024
-    elif text.endswith('M'):
-        return int(text[:-1]) * 1024 * 1024
-    else:
-        if text.startswith('0x'):
-            return int(text, 16)
-        else:
-            return int(text)
-
-
-def write_bin(image_data, offset, file_path, max_size=None):
-    # Read file_path and write to image_data
-    with open(file_path, 'rb') as f:
-        data = f.read()
-        if max_size is not None:
-            assert len(data) <= max_size, f"Data from {file_path} is too large"
-        image_data[offset:offset+len(data)] = data
-        print(f"Write {os.path.basename(file_path)} to 0x{offset:08X} with size 0x{len(data):08X}")
-
-
-'''
-根据 partitions.csv 文件，把 bin 文件打包成一个 4MB 的 image 文件，方便烧录
-'''
-def pack_firmware_image():
-    # Create a 4MB image filled with 0xFF
-    image_size = 4 * 1024 * 1024
-    image_data = bytearray([0xFF] * image_size)
-
-    build_dir = os.path.join(os.path.dirname(__file__), 'build')
-    write_bin(image_data, 0, os.path.join(build_dir, 'bootloader', 'bootloader.bin'))
-    write_bin(image_data, 0x8000, os.path.join(build_dir, 'partition_table', 'partition-table.bin'))
-
-    # 读取 partitions.csv 文件
-    with open('partitions.csv', 'r') as f:
-        reader = csv.reader(f)
-        for row in reader:
-            if row[0] == 'model':   
-                file_path = os.path.join(build_dir, 'srmodels', 'srmodels.bin')
-            elif row[0] == 'factory':
-                file_path = os.path.join(build_dir, 'xiaozhi.bin')
-            else:
-                continue
-
-            offset = read_value(row[3])
-            size = read_value(row[4])
-            write_bin(image_data, offset, file_path, size)
-
-    # 写入 image 文件
-    output_path = os.path.join(build_dir, 'xiaozhi.img')
-    with open(output_path, 'wb') as f:
-        f.write(image_data)
-    print(f"Image file {output_path} created with size 0x{len(image_data):08X}")
-
-    # Compress image with zip without directory
-    os.system(f"zip -j {output_path}.zip {output_path}")
-
-
-if __name__ == '__main__':
-    pack_firmware_image()
--- a/publish.py
+++ b/publish.py
@ -1,51 +0,0 @@
-#! /usr/bin/env python3
-from dotenv import load_dotenv
-load_dotenv()
-
-import os
-import oss2
-import json
-
-def get_version():
-    with open('CMakeLists.txt', 'r') as f:
-        for line in f:
-            if line.startswith('set(PROJECT_VER'):
-                return line.split('"')[1]
-    return '0.0.0'
-
-def upload_bin_to_oss(bin_path, oss_key):
-    auth = oss2.Auth(os.environ['OSS_ACCESS_KEY_ID'], os.environ['OSS_ACCESS_KEY_SECRET'])
-    bucket = oss2.Bucket(auth, os.environ['OSS_ENDPOINT'], os.environ['OSS_BUCKET_NAME'])
-    bucket.put_object(oss_key, open(bin_path, 'rb'))
-
-
-if __name__ == '__main__':
-    # 获取版本号
-    version = get_version()
-    print(f'version: {version}')
-
-    # 上传 bin 文件到 OSS
-    upload_bin_to_oss('build/xiaozhi.bin', f'firmwares/xiaozhi-{version}.bin')
-
-    # File URL
-    file_url = os.path.join(os.environ['OSS_BUCKET_URL'], f'firmwares/xiaozhi-{version}.bin')
-    print(f'Uploaded bin to OSS: {file_url}')
-
-    firmware_json = {
-        "version": version,
-        "url": file_url
-    }
-    with open(f"build/firmware.json", "w") as f:
-        json.dump(firmware_json, f, indent=4)
-    
-    # copy firmware.json to server
-    firmware_config_path = os.environ['FIRMWARE_CONFIG_PATH']
-    ret = os.system(f'scp build/firmware.json {firmware_config_path}')
-    if ret != 0:
-        print(f'Failed to copy firmware.json to server')
-        exit(1)
-    print(f'Copied firmware.json to server: {firmware_config_path}')
-
-
-
-    
--- a/sdkconfig.box
+++ b/sdkconfig.box
--- a/versions.py
+++ b/versions.py
@ -0,0 +1,158 @@
+#! /usr/bin/env python3
+from dotenv import load_dotenv
+load_dotenv()
+
+import os
+import struct
+import zipfile
+import oss2
+import json
+
+def get_chip_id_string(chip_id):
+    return {
+        0x0000: "esp32",
+        0x0002: "esp32s2",
+        0x0005: "esp32c3",
+        0x0009: "esp32s3",
+        0x000C: "esp32c2",
+        0x000D: "esp32c6",
+        0x0010: "esp32h2",
+        0x0011: "esp32c5",
+        0x0012: "esp32p4",
+        0x0017: "esp32c5",
+    }[chip_id]
+
+def get_flash_size(flash_size):
+    MB = 1024 * 1024
+    return {
+        0x00: 1 * MB,
+        0x01: 2 * MB,
+        0x02: 4 * MB,
+        0x03: 8 * MB,
+        0x04: 16 * MB,
+        0x05: 32 * MB,
+        0x06: 64 * MB,
+        0x07: 128 * MB,
+    }[flash_size]
+
+def get_app_desc(data):
+    magic = struct.unpack("<I", data[0x00:0x04])[0]
+    if magic != 0xabcd5432:
+        raise Exception("Invalid app desc magic")
+    version = data[0x10:0x30].decode("utf-8").strip('\0')
+    project_name = data[0x30:0x50].decode("utf-8").strip('\0')
+    time = data[0x50:0x60].decode("utf-8").strip('\0')
+    date = data[0x60:0x70].decode("utf-8").strip('\0')
+    idf_ver = data[0x70:0x90].decode("utf-8").strip('\0')
+    elf_sha256 = data[0x90:0xb0].hex()
+    return {
+        "name": project_name,
+        "version": version,
+        "compile_time": date + "T" + time,
+        "idf_version": idf_ver,
+        "elf_sha256": elf_sha256,
+    }
+
+def get_board_name(folder):
+    basename = os.path.basename(folder)
+    if basename.startswith("v0.2"):
+        return "simple"
+    if basename.startswith("v0.3") or basename.startswith("v0.4"):
+        if "ML307" in basename:
+            return "compact.4g"
+        else:
+            return "compact.wifi"
+    raise Exception(f"Unknown board name: {basename}")
+
+def read_binary(dir_path):
+    merged_bin_path = os.path.join(dir_path, "merged-binary.bin")
+    data = open(merged_bin_path, "rb").read()[0x200000:]
+    if data[0] != 0xE9:
+        print(dir_path, "is not a valid image")
+        return
+    # get flash size
+    flash_size = get_flash_size(data[0x3] >> 4)
+    chip_id = get_chip_id_string(data[0xC])
+    # get segments
+    segment_count = data[0x1]
+    segments = []
+    offset = 0x18
+    for i in range(segment_count):
+        segment_size = struct.unpack("<I", data[offset + 4:offset + 8])[0]
+        offset += 8
+        segment_data = data[offset:offset + segment_size]
+        offset += segment_size
+        segments.append(segment_data)
+    assert offset < len(data), "offset is out of bounds"
+    
+    # extract bin file
+    bin_path = os.path.join(dir_path, "xiaozhi.bin")
+    if not os.path.exists(bin_path):
+        print("extract bin file to", bin_path)
+        open(bin_path, "wb").write(data)
+
+    # The app desc is in the first segment
+    desc = get_app_desc(segments[0])
+    return {
+        "chip_id": chip_id,
+        "flash_size": flash_size,
+        "board": get_board_name(dir_path),
+        "application": desc,
+    }
+
+def extract_zip(zip_path, extract_path):
+    if not os.path.exists(extract_path):
+        os.makedirs(extract_path)
+    print(f"Extracting {zip_path} to {extract_path}")
+    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
+        zip_ref.extractall(extract_path)
+
+def upload_dir_to_oss(source_dir, target_dir):
+    auth = oss2.Auth(os.environ['OSS_ACCESS_KEY_ID'], os.environ['OSS_ACCESS_KEY_SECRET'])
+    bucket = oss2.Bucket(auth, os.environ['OSS_ENDPOINT'], os.environ['OSS_BUCKET_NAME'])
+    for filename in os.listdir(source_dir):
+        oss_key = os.path.join(target_dir, filename)
+        print('uploading', oss_key)
+        bucket.put_object(oss_key, open(os.path.join(source_dir, filename), 'rb'))
+
+def main():
+    release_dir = "releases"
+    versions = []
+    # look for zip files startswith "v"
+    for name in os.listdir(release_dir):
+        if name.startswith("v") and name.endswith(".zip"):
+            tag = name[:-4]
+            folder = os.path.join(release_dir, tag)
+            if not os.path.exists(folder):
+                os.makedirs(folder)
+                extract_zip(os.path.join(release_dir, name), folder)
+                info = read_binary(folder)
+                target_dir = os.path.join("firmwares", tag)
+                info["tag"] = tag
+                info["url"] = os.path.join(os.environ['OSS_BUCKET_URL'], target_dir, "xiaozhi.bin")
+                open(os.path.join(folder, "info.json"), "w").write(json.dumps(info, indent=4))
+                # upload all file to oss
+                upload_dir_to_oss(folder, target_dir)
+            # read info.json
+            info = json.load(open(os.path.join(folder, "info.json")))
+            versions.append(info)
+
+    # sort versions by version
+    versions.sort(key=lambda x: x["tag"], reverse=True)
+    # write versions to file
+    versions_path = os.path.join(release_dir, "versions.json")
+    open(versions_path, "w").write(json.dumps(versions, indent=4))
+    print(f"Versions written to {versions_path}")
+
+    # copy versions.json to server
+    versions_config_path = os.environ['VERSIONS_CONFIG_PATH']
+    ret = os.system(f'scp {versions_path} {versions_config_path}')
+    if ret != 0:
+        print(f'Failed to copy versions.json to server')
+        exit(1)
+    print(f'Copied versions.json to server: {versions_config_path}')
+
+
+
+if __name__ == "__main__":
+    main()
--- a/websocket.md
+++ b/websocket.md
@ -1,160 +0,0 @@
-
-# AI 语音交互通信协议文档
-
-## 1. 连接建立与鉴权
-
-客户端通过 WebSocket 连接到服务器时，需要在 HTTP 头中包含以下信息：
-
- `Authorization`: Bearer token，格式为 "Bearer <access_token>"
- `Device-Id`: 设备 MAC 地址
- `Protocol-Version`: 协议版本号，当前为 2
-
-WebSocket URL: `wss://api.tenclass.net/xiaozhi/v1`
-
-## 2. 二进制数据
-
-客户端发送的二进制数据使用固定头格式的协议，如下：
-
-```cpp
-struct BinaryProtocol {
-    uint16_t version;        // 二进制协议版本，当前为 2
-    uint16_t type;           // 消息类型（0：音频流数据，1：JSON）
-    uint32_t reserved;       // 保留字段
-    uint32_t timestamp;      // 时间戳（保留用作回声消除，也可以用于UDP不可靠传输中的排序）
-    uint32_t payload_size;   // 负载大小
-    uint8_t payload[];       // 可以是音频数据（Opus 编码或协商的音频格式），也可以封装 JSON
-} __attribute__((packed));
-```
-
-注意：所有多字节整数字段使用网络字节序（大端序）。
-
-目前二进制数据跟 JSON 都是走同一个 WebSocket 连接，未来实时对话模式下，二进制音频数据可能走 UDP，可以扩展 hello 消息进行协商。
-
-## 3. 音频数据传输
-
- 客户端到服务器: 使用二进制协议发送 Opus 编码的音频数据
- 服务器到客户端: 使用二进制协议发送 Opus 编码的音频数据，格式与客户端发送的相同
-
-出现 payload_size 为 0 的音频数据包可以用做句子边界标记，可以忽略，但不要报错。
-
-## 4. 握手消息
-
-连接建立后，客户端发送一个 JSON 格式的 "hello" 消息，初始化服务器端的音频解码器。
-不需要等待服务器响应，随后即可发送音频数据。
-
-```json
-{
-  "type": "hello",
-  "response_mode": "auto",
-  "audio_params": {
-    "format": "opus",
-    "sample_rate": 16000,
-    "channels": 1
-  }
-}
-```
-
-应答模式 `response_mode` 可以为 `auto` 或 `manual`。
-
-`auto`：自动应答模式，服务器实时计算音频 VAD 并自动决定何时开始应答。
-
-`manual`：手动应答模式，客户端状态从 `listening` 变为 `idle` 时，服务器可以应答。
-
-## 5. 状态更新
-
-客户端在状态变化时发送 JSON 消息:
-
-```json
-{
-  "type": "state",
-  "state": "<新状态>"
-}
-```
-
-可能发送的状态值包括: `idle`, `wake_word_detected`, `listening`, `speaking`。
-
-示例:
-
-1、按住说话（`response_mode` 为 `manual`）
-
- 当按住说话按钮时，如果未连接服务器，则连接服务器，并编码、缓存当前音频数据，连接成功后，客户端设置状态为 `listening`，并在 hello 消息之后发送缓存的音频数据。
- 当按住说话按钮时，如果已连接服务器，则客户端设置状态为 `listening`，并发送音频数据。
- 当释放说话按钮时，状态变为 `idle`，此时服务器开始识别。
- 服务器开始应答时，推送 `stt` 和 `tts` 消息。
- 客户端开始播放音频时，状态设为 `speaking`。
- 客户端结束播放音频时，状态设为 `idle`。
- 在 `speaking` 状态下，按住说话按钮，会立即停止当前音频播放，状态变为 `listening`。
-
-2、语音唤醒，轮流对话（`response_mode` 为 `auto`）
-
- 连接服务器，发送 hello 消息，发送唤醒词音频数据，然后发送状态 `wake_word_detected`，服务器开始应答。
- 客户端开始播放音频时，状态设为 `speaking`，此时客户端不会发送音频数据。
- 客户端结束播放音频时，状态设为 `listening`，此时客户端发送音频数据。
- 服务器计算音频 VAD 自动选择时机开始应答时，推送 `stt` 和 `tts` 消息。
- 客户端收到 `tts`.`start` 时，开始播放音频，状态设为 `speaking`。
- 客户端收到 `tts`.`stop` 时，停止播放音频，状态设为 `listening`。
-
-3、语音唤醒，实时对话（`response_mode` 为 `real_time`）
-
- 连接服务器，发送 hello 消息，发送唤醒词音频数据，然后发送状态 `wake_word_detected`，服务器开始应答。
- 客户端开始播放音频时，状态设为 `speaking`。
- 客户端结束播放音频时，状态设为 `listening`。
- 在 `speaking` 和 `listening` 状态下，客户端都会发送音频数据。
- 服务器计算音频 VAD 自动选择时机开始应答时，推送 `stt` 和 `tts` 消息。
- 客户端收到 `stt` 时，状态设为 `listening`。如果当前有音频正在播放，则在当前 sentence 结束后停止播放音频。
- 客户端收到 `tts`.`start` 时，开始播放音频，状态设为 `speaking`。
- 客户端收到 `tts`.`stop` 时，停止播放音频，状态设为 `listening`。
-
-## 6. 服务器到客户端的消息
-
-### 6.1 语音识别结果 (STT)
-
-```json
-{
-  "type": "stt",
-  "text": "<识别出的文本>"
-}
-```
-
-### 6.2 文本转语音 (TTS)
-
-TTS开始:
-```json
-{
-  "type": "tts",
-  "state": "start",
-  "sample_rate": 24000
-}
-```
-
-句子开始:
-```json
-{
-  "type": "tts",
-  "state": "sentence_start",
-  "text": "你在干什么呀？"
-}
-```
-
-句子结束:
-```json
-{
-  "type": "tts",
-  "state": "sentence_end"
-}
-```
-
-TTS结束:
-```json
-{
-  "type": "tts",
-  "state": "stop"
-}
-```
-
-## 7. 连接管理
-
- 客户端检测到 WebSocket 断开连接时，应该停止音频播放并重置为空闲状态
- 在断开连接后，客户端按需重新发起连接（比如按钮按下或语音唤醒）
-
-这个文档概括了 WebSocket 通信协议的主要方面。