Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5cb5189
Introduce cppjieba as a submodule for Chinese word segmentation
CrazySteve0605 Jul 23, 2025
fb4efef
Update what's new
CrazySteve0605 Jul 24, 2025
ae58e9b
Add comments for building script of cppjieba and its dependency
CrazySteve0605 Jul 24, 2025
06070c1
Update projectDocs/dev/createDevEnvironment.md
CrazySteve0605 Aug 1, 2025
2273a60
Update include/readme.md
CrazySteve0605 Aug 1, 2025
3d4d9f1
Remove changes in sconscript for localLIb
CrazySteve0605 Aug 1, 2025
1fbf05f
add building script for cppjieba
CrazySteve0605 Aug 4, 2025
7de7464
add JiebaSingleton wrapper and C API for NVDA segmentation
CrazySteve0605 Aug 5, 2025
f4cab8a
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 5, 2025
d4c3a92
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 6, 2025
0d92c08
Update GitHub action workflow to fetch cppjieba's submodule
CrazySteve0605 Aug 6, 2025
da662be
Update .gitignore for cppjieba
CrazySteve0605 Aug 6, 2025
38a12dc
Update building and setup script for cppjieba's dicts installation
CrazySteve0605 Aug 6, 2025
c60c2da
update copyright headers based on @seanbudd's suggestions
CrazySteve0605 Aug 9, 2025
c853b64
Update include/readme.md
CrazySteve0605 Aug 9, 2025
a643391
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Aug 28, 2025
3e495d2
add wrappers for user dict management
CrazySteve0605 Aug 28, 2025
11827fb
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 4, 2025
1869ed0
simplify the initialization of cppjieba
CrazySteve0605 Sep 4, 2025
fe118ee
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 6, 2025
49cc1fe
update cppjieba to the latest commit
CrazySteve0605 Sep 6, 2025
a9281f6
update .gitattributes for .hpp header files
CrazySteve0605 Sep 6, 2025
2955ca8
simplify helper of `cppjieba`
CrazySteve0605 Sep 7, 2025
53158b6
update changelog
CrazySteve0605 Sep 7, 2025
30120f8
update building script
CrazySteve0605 Sep 7, 2025
00796fe
revert installing script
CrazySteve0605 Sep 7, 2025
09b1890
fix building script
CrazySteve0605 Sep 7, 2025
bec5dc5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 9, 2025
b3e08ee
update helper of `coojieba`
CrazySteve0605 Sep 9, 2025
0f507d5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 Sep 20, 2025
c2cbb24
Revert "Update projectDocs/dev/createDevEnvironment.md"
CrazySteve0605 Sep 20, 2025
194a69e
avoid using compilation time path
CrazySteve0605 Sep 20, 2025
2e730d6
Update .gitattributes
CrazySteve0605 Sep 20, 2025
30e855f
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran Sep 22, 2025
b327e23
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran Sep 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ sconstruct text diff=python
*.c text diff=c
*.cpp text diff=cpp
*.h text diff=c
*.hpp text diff=cpp
*.idl text diff=c
*.acf text diff=c

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/testAndPublish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ jobs:
- name: Checkout NVDA
uses: actions/checkout@v5
with:
submodules: true
submodules: recursive
- name: Install Python
uses: actions/setup-python@v5
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ source/lib
source/lib64
source/typelibs
source/louis
source/cppjieba
*.obj
*.exp
*.lib
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,6 @@
[submodule "include/nvda-mathcat"]
path = include/nvda-mathcat
url = https://github.com/nvaccess/nvda-mathcat.git
[submodule "include/cppjieba"]
path = include/cppjieba
url = https://github.com/yanyiwu/cppjieba
1 change: 1 addition & 0 deletions copying.txt
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,7 @@ In addition to these dependencies, the following are also included in NVDA:
* Microsoft Detours: MIT
* Python: PSF
* NSIS: zlib/libpng
* cppjieba: MIT

Furthermore, NVDA also utilises some static/binary dependencies, details of which can be found at the following URL:

Expand Down
1 change: 1 addition & 0 deletions include/cppjieba
Submodule cppjieba added at 9408c1
7 changes: 7 additions & 0 deletions include/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,10 @@ Used in chrome system tests.
<https://github.com/microsoft/wil/>

Fetch latest from master.

### cppjieba

[cppjieba](https://github.com/yanyiwu/cppjieba).

Fetch latest from master.
Used for Chinese text segmentation.
4 changes: 4 additions & 0 deletions nvdaHelper/archBuild_sconscript
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,10 @@ Export("detoursLib")
apiHookObj = env.Object("apiHook", "common/apiHook.cpp")
Export("apiHookObj")

cppjiebaLib = env.SConscript("cppjieba/sconscript")
Export("cppjiebaLib")
env.Install(libInstallDir, cppjiebaLib)

if isNVDAHelperLocalArch:
localLib = env.SConscript("local/sconscript")
Export("localLib")
Expand Down
124 changes: 124 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
/*
A part of NonVisual Desktop Access (NVDA)
Copyright (C) 2025 NV Access Limited, Wang Chong
This file may be used under the terms of the GNU General Public License, version 2 or later, as modified by the NVDA license.
For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt
*/

#include "cppjieba.hpp"


using namespace std;

// static members for singleton bookkeeping
JiebaSingleton* JiebaSingleton::instance = nullptr;
std::once_flag JiebaSingleton::initFlag;

JiebaSingleton& JiebaSingleton::getInstance(const char* dictDir) {
// convert incoming C-string+length to std::string (handles dictDir == nullptr)
std::string dir = dictDir;

// ensure singleton is constructed exactly once
std::call_once(JiebaSingleton::initFlag, [&]() {
// allocate on heap, so we avoid copy/move and control lifetime
JiebaSingleton::instance = new JiebaSingleton(dir.c_str());
// optional: register deleter at exit
std::atexit([]() {
delete JiebaSingleton::instance;
JiebaSingleton::instance = nullptr;
});
});

// after call_once, instance must be non-null
return *JiebaSingleton::instance;
}

JiebaSingleton& JiebaSingleton::getInstance() {
if (!JiebaSingleton::instance) {
throw std::runtime_error("JiebaSingleton::getInstance() called before initialization. Call getInstance(dictDir) or initJieba() first.");
}
return *JiebaSingleton::instance;
}

JiebaSingleton::JiebaSingleton(const char* dictDir)
: cppjieba::JiebaSegmenter(
std::string(dictDir),
std::string(dictDir),
std::string(dictDir)
)
{
// base class ctor will load dictionaries/models
}

void JiebaSingleton::getWordEndOffsets(const std::string& text, std::vector<int>& wordEndOffsets) {
std::lock_guard<std::mutex> lock(segMutex);
wordEndOffsets.clear();
std::vector<cppjieba::Word> words;
this->Cut(text, words, true);

for (const auto& word : words) {
wordEndOffsets.push_back(word.unicode_offset + word.unicode_length);
}
}

extern "C" {

bool initJieba(const char* dictDir) {
try {
// simply force the singleton into existence
(void)JiebaSingleton::getInstance(dictDir);
return true;
} catch (...) {
return false;
}
}

bool calculateWordOffsets(const char* text, int** wordEndOffsets, int* outLen) {
if (!text || !wordEndOffsets || !outLen) return false;

try {
std::string textStr(text);
std::vector<int> offs;
JiebaSingleton::getInstance().getWordEndOffsets(textStr, offs);

int n = static_cast<int>(offs.size());
if (n == 0) {
*wordEndOffsets = nullptr;
*outLen = 0;
return true; // success, but no offsets
}

int* buf = static_cast<int*>(std::malloc(sizeof(int) * n));
if (!buf) {
*wordEndOffsets = nullptr;
*outLen = 0;
return false;
}
for (int i = 0; i < n; ++i) buf[i] = offs[i];
*wordEndOffsets = buf;
*outLen = n;
return true;
} catch (...) {
*wordEndOffsets = nullptr;
*outLen = 0;
return false;
}
}

bool insertUserWord(const char* word, int freq, const char* tag = cppjieba::UNKNOWN_TAG) {
return JiebaSingleton::getInstance().InsertUserWord(string(word), freq, string(tag));
}

bool deleteUserWord(const char* word, const char* tag = cppjieba::UNKNOWN_TAG) {
return JiebaSingleton::getInstance().DeleteUserWord(string(word), string(tag));
}

bool find(const char* word) {
return JiebaSingleton::getInstance().Find(string(word));
}

void freeOffsets(int* ptr) {
if (ptr) std::free(ptr);
}

} // extern "C"
8 changes: 8 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
LIBRARY cppjieba
EXPORTS
initJieba
calculateWordOffsets
insertUserWord
deleteUserWord
find
freeOffsets
165 changes: 165 additions & 0 deletions nvdaHelper/cppjieba/cppjieba.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
/*
A part of NonVisual Desktop Access (NVDA)
Copyright (C) 2025 NV Access Limited, Wang Chong
This file may be used under the terms of the GNU General Public License, version 2 or later, as modified by the NVDA license.
For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt
*/

#ifndef CPPJIEBA_DLL_H
#define CPPJIEBA_DLL_H
#pragma once

#include <vector>
#include <string>
#include <cstring>
#include <mutex>
#include <cstdlib>
#include <set>
#include <stdexcept>
#include "QuerySegment.hpp"

using namespace std;

namespace cppjieba { // copied from Jieba.hpp and modified to drop off its keyword extractor we don't use

class JiebaSegmenter {
public:
JiebaSegmenter(const string& dict_path,
const string& model_path,
const string& user_dict_path)
: dict_trie_(pathJoin(dict_path, "jieba.dict.utf8"), pathJoin(user_dict_path, "user.dict.utf8")),
model_(pathJoin(model_path, "hmm_model.utf8")),
mix_seg_(&dict_trie_, &model_) {
}
~JiebaSegmenter() {
}

void Cut(const string& sentence, vector<Word>& words, bool hmm = true) const {
mix_seg_.Cut(sentence, words, hmm);
}

bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) {
return dict_trie_.InsertUserWord(word,freq, tag);
}

bool DeleteUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
return dict_trie_.DeleteUserWord(word, tag);
}

bool Find(const string& word)
{
return dict_trie_.Find(word);
}

void ResetSeparators(const string& s) {
mix_seg_.ResetSeparators(s);
}

const DictTrie* GetDictTrie() const {
return &dict_trie_;
}

const HMMModel* GetHMMModel() const {
return &model_;
}

void LoadUserDict(const vector<string>& buf) {
dict_trie_.LoadUserDict(buf);
}

void LoadUserDict(const set<string>& buf) {
dict_trie_.LoadUserDict(buf);
}

void LoadUserDict(const string& path) {
dict_trie_.LoadUserDict(path);
}

private:
static string pathJoin(const string& dir, const string& filename) {
if (dir.empty()) {
return filename;
}

char last_char = dir[dir.length() - 1];
if (last_char == '/' || last_char == '\\') {
return dir + filename;
} else {
#ifdef _WIN32
return dir + '\\' + filename;
#else
return dir + '/' + filename;
#endif
}
}

static string getCurrentDirectory() {
string path(__FILE__);
size_t pos = path.find_last_of("/\\");
return (pos == string::npos) ? "" : path.substr(0, pos);
}

DictTrie dict_trie_;
HMMModel model_;

MixSegment mix_seg_;
}; // class JiebaSegmenter

} // namespace cppjieba


/// @brief Singleton wrapper around cppjieba::Jieba.
class JiebaSingleton : public cppjieba::JiebaSegmenter {
public:
/// @brief Returns the single instance, constructing on first call.
static JiebaSingleton& getInstance(const char* dictDir);

static JiebaSingleton& getInstance();

/// @brief Do thread-safe segmentation and compute word end offsets.
/// @param text The input text in UTF-8 encoding.
/// @param wordEndOffsets Output vector to hold byte offsets of word ends.
void getWordEndOffsets(const std::string& text, std::vector<int>& wordEndOffsets);

// singleton bookkeeping
static JiebaSingleton* instance;
static std::once_flag initFlag;

private:
JiebaSingleton(const char* dictDir); ///< private ctor initializes base Jieba

/// Disable copy and move
JiebaSingleton(const JiebaSingleton&) = delete;
JiebaSingleton& operator = (const JiebaSingleton&) = delete;
JiebaSingleton(JiebaSingleton&&) = delete;
JiebaSingleton& operator = (JiebaSingleton&&) = delete;

std::mutex segMutex; ///< guards concurrent Cut() calls
};

#ifdef _WIN32
# define JIEBA_API __declspec(dllexport)
#else
# define JIEBA_API
#endif

extern "C" {

/// @brief Force singleton construction (load dicts, etc.) before any segmentation.
JIEBA_API bool initJieba(const char* dictDir);

/// @brief Segment UTF-8 text into character offsets.
/// @return 0 on success, -1 on failure.
JIEBA_API bool calculateWordOffsets(const char* text, int** wordEndOffsets, int* outLen);

/// Wrapper for word management
JIEBA_API bool insertUserWord(const char* word, int freq, const char* tag);
JIEBA_API bool deleteUserWord(const char* word, const char* tag);
JIEBA_API bool find(const char* word);

/// @brief Free memory allocated by calculateWordOffsets.
JIEBA_API void freeOffsets(int* ptr);

} // extern "C"

#endif // CPPJIEBA_DLL_H
Loading