Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5cb5189
Introduce cppjieba as a submodule for Chinese word segmentation
CrazySteve0605 Jul 23, 2025
fb4efef
Update what's new
CrazySteve0605 Jul 24, 2025
ae58e9b
Add comments for building script of cppjieba and its dependency
CrazySteve0605 Jul 24, 2025
06070c1
Update projectDocs/dev/createDevEnvironment.md
CrazySteve0605 Aug 1, 2025
2273a60
Update include/readme.md
CrazySteve0605 Aug 1, 2025
3d4d9f1
Remove changes in sconscript for localLIb
CrazySteve0605 Aug 1, 2025
1fbf05f
add building script for cppjieba
CrazySteve0605 Aug 4, 2025
7de7464
add JiebaSingleton wrapper and C API for NVDA segmentation
CrazySteve0605 Aug 5, 2025
f4cab8a
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 5, 2025
d4c3a92
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 6, 2025
0d92c08
Update GitHub action workflow to fetch cppjieba's submodule
CrazySteve0605 Aug 6, 2025
da662be
Update .gitignore for cppjieba
CrazySteve0605 Aug 6, 2025
38a12dc
Update building and setup script for cppjieba's dicts installation
CrazySteve0605 Aug 6, 2025
c60c2da
update copyright headers based on @seanbudd's suggestions
CrazySteve0605 Aug 9, 2025
c853b64
Update include/readme.md
CrazySteve0605 Aug 9, 2025
a643391
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 28, 2025
3e495d2
add wrappers for user dict management
CrazySteve0605 Aug 28, 2025
11827fb
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 4, 2025
1869ed0
simplify the initialization of cppjieba
CrazySteve0605 Sep 4, 2025
fe118ee
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 6, 2025
49cc1fe
update cppjieba to the latest commit
CrazySteve0605 Sep 6, 2025
a9281f6
update .gitattributes for .hpp header files
CrazySteve0605 Sep 6, 2025
2955ca8
simplify helper of `cppjieba`
CrazySteve0605 Sep 7, 2025
53158b6
update changelog
CrazySteve0605 Sep 7, 2025
30120f8
update building script
CrazySteve0605 Sep 7, 2025
00796fe
revert installing script
CrazySteve0605 Sep 7, 2025
09b1890
fix building script
CrazySteve0605 Sep 7, 2025
bec5dc5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 9, 2025
b3e08ee
update helper of `coojieba`
CrazySteve0605 Sep 9, 2025
0f507d5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 20, 2025
c2cbb24
Revert "Update projectDocs/dev/createDevEnvironment.md"
CrazySteve0605 Sep 20, 2025
194a69e
avoid using compilation time path
CrazySteve0605 Sep 20, 2025
2e730d6
Update .gitattributes
CrazySteve0605 Sep 20, 2025
30e855f
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran Sep 22, 2025
b327e23
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran Sep 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/testAndPublish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
- name: Checkout NVDA
uses: actions/checkout@v4
with:
submodules: true
submodules: recursive
- name: Install Python
uses: actions/setup-python@v5
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ source/lib
source/lib64
source/typelibs
source/louis
source/cppjieba
*.obj
*.exp
*.lib
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,6 @@
[submodule "include/nvda-mathcat"]
path = include/nvda-mathcat
url = https://github.com/nvaccess/nvda-mathcat.git
[submodule "include/cppjieba"]
path = include/cppjieba
url = https://github.com/yanyiwu/cppjieba
1 change: 1 addition & 0 deletions copying.txt
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@ In addition to these dependencies, the following are also included in NVDA:
- Microsoft Detours: MIT
- Python: PSF
- NSIS: zlib/libpng
- cppjieba: MIT

Furthermore, NVDA also utilises some static/binary dependencies, details of which can be found at the following URL:

Expand Down
1 change: 1 addition & 0 deletions include/cppjieba
Submodule cppjieba added at 9b4090
7 changes: 7 additions & 0 deletions include/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,10 @@ Used in chrome system tests.
https://github.com/microsoft/wil/

Fetch latest from master.

### cppjieba

[cppjieba](https://github.com/yanyiwu/cppjieba)

Fetch latest from master.
Used for Chinese text segmentation.
6 changes: 5 additions & 1 deletion nvdaHelper/archBuild_sconscript
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# A part of NonVisual Desktop Access (NVDA)
# Copyright (C) 2006-2023 NV Access Limited
# Copyright (C) 2006-2025 NV Access Limited
# This file may be used under the terms of the GNU General Public License, version 2 or later.
# For more details see: https://www.gnu.org/licenses/gpl-2.0.html

Expand Down Expand Up @@ -209,6 +209,10 @@ Export("detoursLib")
apiHookObj = env.Object("apiHook", "common/apiHook.cpp")
Export("apiHookObj")

cppjiebaLib = env.SConscript("cppjieba/sconscript")
Export("cppjiebaLib")
env.Install(libInstallDir, cppjiebaLib)

if TARGET_ARCH == "x86":
localLib = env.SConscript("local/sconscript")
Export("localLib")
Expand Down
85 changes: 85 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
/*
This file is a part of the NVDA project.
URL: http://www.nvda-project.org/
Copyright 2025 NV Access Limited, Wang Chong.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2.0, as published by
the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This license can be found at:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
*/

#include "cppjieba.hpp"

JiebaSingleton& JiebaSingleton::getInstance() {
// C++11 guarantees thread-safe init of this local static
static JiebaSingleton instance;
return instance;
}

JiebaSingleton::JiebaSingleton(): cppjieba::Jieba() { } // call base ctor to load dictionaries, models, etc.

void JiebaSingleton::getOffsets(const std::string& text, std::vector<int>& charOffsets) {
std::lock_guard<std::mutex> lock(segMutex);
std::vector<std::string> words;
this->Cut(text, words, true);

int cumulative = 0;
for (auto const& w : words) {
int wc = 0;
auto ptr = reinterpret_cast<const unsigned char*>(w.c_str());
size_t i = 0, len = w.size();
while (i < len) {
unsigned char c = ptr[i];
if ((c & 0x80) == 0) i += 1;
else if ((c & 0xE0) == 0xC0) i += 2;
else if ((c & 0xF0) == 0xE0) i += 3;
else if ((c & 0xF8) == 0xF0) i += 4;
else i += 1;
++wc;
}
cumulative += wc;
charOffsets.push_back(cumulative);
}
}

extern "C" {

int initJieba() {
try {
// simply force the singleton into existence
(void)JiebaSingleton::getInstance();
return 0;
} catch (...) {
return -1;
}
}

int segmentOffsets(const char* text, int** charOffsets, int* outLen) {
if (!text || !charOffsets || !outLen) return -1;
// we assume initJieba() has already been called successfully

std::string input(text);
std::vector<int> offs;
JiebaSingleton::getInstance().getOffsets(input, offs);

int n = static_cast<int>(offs.size());
int* buf = static_cast<int*>(std::malloc(sizeof(int) * n));
if (!buf) {
*outLen = 0;
return -1;
}
for (int i = 0; i < n; ++i) buf[i] = offs[i];
*charOffsets = buf;
*outLen = n;
return 0;
}

void freeOffsets(int* ptr) {
if (ptr) free(ptr);
}

} // extern "C"
5 changes: 5 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
LIBRARY cppjieba
EXPORTS
initJieba
segmentOffsets
freeOffsets
71 changes: 71 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/*
This file is a part of the NVDA project.
URL: http://www.nvda-project.org/
Copyright 2025 NV Access Limited, Wang Chong.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2.0, as published by
the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This license can be found at:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
*/

#ifndef CPPJIEBA_DLL_H
#define CPPJIEBA_DLL_H
#pragma once

#include <vector>
#include <string>
#include <mutex>
#include <cstdlib>
#include "Jieba.hpp"

#ifdef _WIN32
# define JIEBA_API __declspec(dllexport)
#else
# define JIEBA_API
#endif

using namespace std;

/// @brief Singleton wrapper around cppjieba::Jieba.
class JiebaSingleton : public cppjieba::Jieba {
public:
/// @brief Returns the single instance, constructing on first call.
static JiebaSingleton& getInstance();

/// @brief Do thread-safe segmentation and compute character end offsets.
/// @param text The input text in UTF-8 encoding.
/// @param charOffsets Output vector to hold character offsets.
void getOffsets(const string& text, vector<int>& charOffsets);

private:
JiebaSingleton(); ///< private ctor initializes base Jieba

/// Disable copy and move
JiebaSingleton(const JiebaSingleton&) = delete;
JiebaSingleton& operator = (const JiebaSingleton&) = delete;
JiebaSingleton(JiebaSingleton&&) = delete;
JiebaSingleton& operator = (JiebaSingleton&&) = delete;

std::mutex segMutex; ///< guards concurrent Cut() calls
};

extern "C" {

/// @brief Force singleton construction (load dicts, etc.) before any segmentation.
/// @return 0 on success, -1 on failure.
JIEBA_API int initJieba();

/// @brief Segment UTF-8 text into character offsets.
/// @return 0 on success, -1 on failure.
JIEBA_API int segmentOffsets(const char* text, int** charOffsets, int* outLen);

/// @brief Free memory allocated by segmentOffsets.
JIEBA_API void freeOffsets(int* ptr);

} // extern "C"

#endif // CPPJIEBA_DLL_H
55 changes: 55 additions & 0 deletions nvdaHelper/cppjieba/sconscript
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# A part of NonVisual Desktop Access (NVDA)
# Copyright (C) 2025 NV Access Limited, Wang Chong.
# This file may be used under the terms of the GNU General Public License, version 2 or later.
# For more details see: https://www.gnu.org/licenses/gpl-2.0.html

import typing # noqa: E402
import os

Import(
[
"thirdPartyEnv",
"sourceDir",
]
)
thirdPartyEnv: Environment = thirdPartyEnv
env: Environment = typing.cast(Environment, thirdPartyEnv.Clone())

cppjiebaPath = Dir("#include/cppjieba")
cppjiebaSrcPath = cppjiebaPath.Dir("include/cppjieba")
cppjiebaDictPath = cppjiebaPath.Dir("dict")
outDir = sourceDir.Dir("cppjieba")
unitTestDictsDir = env.Dir("#tests/unit/cppjiebaDicts")
LimonpPath = cppjiebaPath.Dir("deps/limonp") # cppjieba's dependency
LimonpSrcPath = LimonpPath.Dir("include/limonp")

env.Prepend(
CPPPATH=[
cppjiebaSrcPath,
LimonpSrcPath.Dir(".."),
]
)

sourceFiles = [
"cppjieba.cpp",
"cppjieba.def",
]

cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles)

if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning
env.Install(
outDir.Dir("dicts"),
[
f
for f in env.Glob(f"{cppjiebaDictPath}/*")
if f.name
not in (
"README.md",
"pos_dict",
)
and not f.name.endswith(".in")
],
)

Return("cppjiebaLib")
2 changes: 1 addition & 1 deletion projectDocs/dev/createDevEnvironment.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ If you aren't sure, run `git submodule update` after every git pull, merge or ch
* [Nullsoft Install System](https://nsis.sourceforge.io), version 3.11
* [Java Access Bridge 32 bit, from Zulu Community OpenJDK build 17.0.9+8Zulu (17.46.19)](https://github.com/nvaccess/javaAccessBridge32-bin)
* [Windows Implementation Libraries (WIL)](https://github.com/microsoft/wil/)
* [NVDA DiffMatchPatch](https://github.com/codeofdusk/nvda_dmp)
* [cppjieba - Chinese word segmentation](https://github.com/yanyiwu/cppjieba), commit `9b40903ed6cbd795367ea64f9a7d3f3bc4aa4714`

#### Build time dependencies

Expand Down
1 change: 1 addition & 0 deletions source/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ def _genManifestTemplate(shouldHaveUIAccess: bool) -> tuple[int, int, bytes]:
("images", glob("images/*.ico")),
("fonts", glob("fonts/*.ttf")),
("louis/tables", glob("louis/tables/*")),
("cppjieba/dicts", glob("cppjieba/dicts/*")),
("COMRegistrationFixes", glob("COMRegistrationFixes/*.reg")),
("miscDeps/tools", ["../miscDeps/tools/msgfmt.exe"]),
(".", glob("../miscDeps/python/*.dll")),
Expand Down
2 changes: 1 addition & 1 deletion user_docs/en/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Please refer to [the developer guide](https://download.nvaccess.org/documentatio
* Updated `include` dependencies:
* detours to `9764cebcb1a75940e68fa83d6730ffaf0f669401`. (#18447, @LeonarddeR)
* The `nvda_dmp` utility has been removed. (#18480, @codeofdusk)
* `comInterfaces_sconscript` has been updated to make the generated files in `comInterfaces` work better with IDEs. (#17608, @gexgd0419)
* Added [cppjieba](https://github.com/yanyiwu/cppjieba) as a git submodule for word segmentation. (#18548, @CrazySteve0605)

#### Deprecations

Expand Down
Loading