-
-
Notifications
You must be signed in to change notification settings - Fork 738
Introduce cppjieba as a submodule for Chinese word segmentation #18548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
michaelDCurran
merged 35 commits into
nvaccess:try-chineseWordSegmentation-staging
from
CrazySteve0605:integrateCPPJieba
Sep 29, 2025
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
5cb5189
Introduce cppjieba as a submodule for Chinese word segmentation
CrazySteve0605 fb4efef
Update what's new
CrazySteve0605 ae58e9b
Add comments for building script of cppjieba and its dependency
CrazySteve0605 06070c1
Update projectDocs/dev/createDevEnvironment.md
CrazySteve0605 2273a60
Update include/readme.md
CrazySteve0605 3d4d9f1
Remove changes in sconscript for localLIb
CrazySteve0605 1fbf05f
add building script for cppjieba
CrazySteve0605 7de7464
add JiebaSingleton wrapper and C API for NVDA segmentation
CrazySteve0605 f4cab8a
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 d4c3a92
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 0d92c08
Update GitHub action workflow to fetch cppjieba's submodule
CrazySteve0605 da662be
Update .gitignore for cppjieba
CrazySteve0605 38a12dc
Update building and setup script for cppjieba's dicts installation
CrazySteve0605 c60c2da
update copyright headers based on @seanbudd's suggestions
CrazySteve0605 c853b64
Update include/readme.md
CrazySteve0605 a643391
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 3e495d2
add wrappers for user dict management
CrazySteve0605 11827fb
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 1869ed0
simplify the initialization of cppjieba
CrazySteve0605 fe118ee
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 49cc1fe
update cppjieba to the latest commit
CrazySteve0605 a9281f6
update .gitattributes for .hpp header files
CrazySteve0605 2955ca8
simplify helper of `cppjieba`
CrazySteve0605 53158b6
update changelog
CrazySteve0605 30120f8
update building script
CrazySteve0605 00796fe
revert installing script
CrazySteve0605 09b1890
fix building script
CrazySteve0605 bec5dc5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 b3e08ee
update helper of `coojieba`
CrazySteve0605 0f507d5
Merge branch 'master' into integrateCPPJieba
CrazySteve0605 c2cbb24
Revert "Update projectDocs/dev/createDevEnvironment.md"
CrazySteve0605 194a69e
avoid using compilation time path
CrazySteve0605 2e730d6
Update .gitattributes
CrazySteve0605 30e855f
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran b327e23
Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…
michaelDCurran File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ source/lib | |
| source/lib64 | ||
| source/typelibs | ||
| source/louis | ||
| source/cppjieba | ||
| *.obj | ||
| *.exp | ||
| *.lib | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| /* | ||
| A part of NonVisual Desktop Access (NVDA) | ||
| Copyright (C) 2025 NV Access Limited, Wang Chong | ||
| This file may be used under the terms of the GNU General Public License, version 2 or later, as modified by the NVDA license. | ||
| For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt | ||
| */ | ||
|
|
||
| #include "cppjieba.hpp" | ||
|
|
||
|
|
||
| using namespace std; | ||
|
|
||
| // static members for singleton bookkeeping | ||
| JiebaSingleton* JiebaSingleton::instance = nullptr; | ||
| std::once_flag JiebaSingleton::initFlag; | ||
|
|
||
| JiebaSingleton& JiebaSingleton::getInstance(const char* dictDir) { | ||
| // convert incoming C-string+length to std::string (handles dictDir == nullptr) | ||
| std::string dir = dictDir; | ||
|
|
||
| // ensure singleton is constructed exactly once | ||
| std::call_once(JiebaSingleton::initFlag, [&]() { | ||
| // allocate on heap, so we avoid copy/move and control lifetime | ||
| JiebaSingleton::instance = new JiebaSingleton(dir.c_str()); | ||
| // optional: register deleter at exit | ||
| std::atexit([]() { | ||
| delete JiebaSingleton::instance; | ||
| JiebaSingleton::instance = nullptr; | ||
| }); | ||
| }); | ||
|
|
||
| // after call_once, instance must be non-null | ||
| return *JiebaSingleton::instance; | ||
| } | ||
|
|
||
| JiebaSingleton& JiebaSingleton::getInstance() { | ||
| if (!JiebaSingleton::instance) { | ||
| throw std::runtime_error("JiebaSingleton::getInstance() called before initialization. Call getInstance(dictDir) or initJieba() first."); | ||
| } | ||
| return *JiebaSingleton::instance; | ||
| } | ||
|
|
||
| JiebaSingleton::JiebaSingleton(const char* dictDir) | ||
| : cppjieba::JiebaSegmenter( | ||
| std::string(dictDir), | ||
| std::string(dictDir), | ||
| std::string(dictDir) | ||
| ) | ||
| { | ||
| // base class ctor will load dictionaries/models | ||
| } | ||
|
|
||
| void JiebaSingleton::getWordEndOffsets(const std::string& text, std::vector<int>& wordEndOffsets) { | ||
| std::lock_guard<std::mutex> lock(segMutex); | ||
| wordEndOffsets.clear(); | ||
| std::vector<cppjieba::Word> words; | ||
| this->Cut(text, words, true); | ||
|
|
||
| for (const auto& word : words) { | ||
| wordEndOffsets.push_back(word.unicode_offset + word.unicode_length); | ||
| } | ||
| } | ||
|
|
||
| extern "C" { | ||
|
|
||
| bool initJieba(const char* dictDir) { | ||
| try { | ||
| // simply force the singleton into existence | ||
| (void)JiebaSingleton::getInstance(dictDir); | ||
| return true; | ||
| } catch (...) { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| bool calculateWordOffsets(const char* text, int** wordEndOffsets, int* outLen) { | ||
| if (!text || !wordEndOffsets || !outLen) return false; | ||
|
|
||
| try { | ||
| std::string textStr(text); | ||
| std::vector<int> offs; | ||
| JiebaSingleton::getInstance().getWordEndOffsets(textStr, offs); | ||
|
|
||
| int n = static_cast<int>(offs.size()); | ||
| if (n == 0) { | ||
| *wordEndOffsets = nullptr; | ||
| *outLen = 0; | ||
| return true; // success, but no offsets | ||
| } | ||
|
|
||
| int* buf = static_cast<int*>(std::malloc(sizeof(int) * n)); | ||
| if (!buf) { | ||
| *wordEndOffsets = nullptr; | ||
| *outLen = 0; | ||
| return false; | ||
| } | ||
| for (int i = 0; i < n; ++i) buf[i] = offs[i]; | ||
| *wordEndOffsets = buf; | ||
| *outLen = n; | ||
| return true; | ||
| } catch (...) { | ||
| *wordEndOffsets = nullptr; | ||
| *outLen = 0; | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| bool insertUserWord(const char* word, int freq, const char* tag = cppjieba::UNKNOWN_TAG) { | ||
| return JiebaSingleton::getInstance().InsertUserWord(string(word), freq, string(tag)); | ||
| } | ||
|
|
||
| bool deleteUserWord(const char* word, const char* tag = cppjieba::UNKNOWN_TAG) { | ||
| return JiebaSingleton::getInstance().DeleteUserWord(string(word), string(tag)); | ||
| } | ||
|
|
||
| bool find(const char* word) { | ||
| return JiebaSingleton::getInstance().Find(string(word)); | ||
| } | ||
|
|
||
| void freeOffsets(int* ptr) { | ||
| if (ptr) std::free(ptr); | ||
| } | ||
|
|
||
| } // extern "C" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| LIBRARY cppjieba | ||
| EXPORTS | ||
| initJieba | ||
| calculateWordOffsets | ||
| insertUserWord | ||
| deleteUserWord | ||
| find | ||
| freeOffsets |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,165 @@ | ||
| /* | ||
CrazySteve0605 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| A part of NonVisual Desktop Access (NVDA) | ||
| Copyright (C) 2025 NV Access Limited, Wang Chong | ||
| This file may be used under the terms of the GNU General Public License, version 2 or later, as modified by the NVDA license. | ||
| For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt | ||
| */ | ||
|
|
||
| #ifndef CPPJIEBA_DLL_H | ||
| #define CPPJIEBA_DLL_H | ||
| #pragma once | ||
|
|
||
| #include <vector> | ||
| #include <string> | ||
| #include <cstring> | ||
| #include <mutex> | ||
| #include <cstdlib> | ||
| #include <set> | ||
| #include <stdexcept> | ||
| #include "QuerySegment.hpp" | ||
|
|
||
| using namespace std; | ||
|
|
||
| namespace cppjieba { // copied from Jieba.hpp and modified to drop off its keyword extractor we don't use | ||
|
|
||
| class JiebaSegmenter { | ||
| public: | ||
| JiebaSegmenter(const string& dict_path, | ||
| const string& model_path, | ||
| const string& user_dict_path) | ||
| : dict_trie_(pathJoin(dict_path, "jieba.dict.utf8"), pathJoin(user_dict_path, "user.dict.utf8")), | ||
| model_(pathJoin(model_path, "hmm_model.utf8")), | ||
| mix_seg_(&dict_trie_, &model_) { | ||
| } | ||
| ~JiebaSegmenter() { | ||
| } | ||
|
|
||
| void Cut(const string& sentence, vector<Word>& words, bool hmm = true) const { | ||
| mix_seg_.Cut(sentence, words, hmm); | ||
| } | ||
|
|
||
| bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) { | ||
| return dict_trie_.InsertUserWord(word,freq, tag); | ||
| } | ||
|
|
||
| bool DeleteUserWord(const string& word, const string& tag = UNKNOWN_TAG) { | ||
| return dict_trie_.DeleteUserWord(word, tag); | ||
| } | ||
|
|
||
| bool Find(const string& word) | ||
| { | ||
| return dict_trie_.Find(word); | ||
| } | ||
|
|
||
| void ResetSeparators(const string& s) { | ||
| mix_seg_.ResetSeparators(s); | ||
| } | ||
|
|
||
| const DictTrie* GetDictTrie() const { | ||
| return &dict_trie_; | ||
| } | ||
|
|
||
| const HMMModel* GetHMMModel() const { | ||
| return &model_; | ||
| } | ||
|
|
||
| void LoadUserDict(const vector<string>& buf) { | ||
| dict_trie_.LoadUserDict(buf); | ||
| } | ||
|
|
||
| void LoadUserDict(const set<string>& buf) { | ||
| dict_trie_.LoadUserDict(buf); | ||
| } | ||
|
|
||
| void LoadUserDict(const string& path) { | ||
| dict_trie_.LoadUserDict(path); | ||
| } | ||
|
|
||
| private: | ||
| static string pathJoin(const string& dir, const string& filename) { | ||
| if (dir.empty()) { | ||
| return filename; | ||
| } | ||
|
|
||
| char last_char = dir[dir.length() - 1]; | ||
| if (last_char == '/' || last_char == '\\') { | ||
| return dir + filename; | ||
| } else { | ||
| #ifdef _WIN32 | ||
| return dir + '\\' + filename; | ||
| #else | ||
| return dir + '/' + filename; | ||
| #endif | ||
| } | ||
| } | ||
|
|
||
| static string getCurrentDirectory() { | ||
| string path(__FILE__); | ||
| size_t pos = path.find_last_of("/\\"); | ||
| return (pos == string::npos) ? "" : path.substr(0, pos); | ||
| } | ||
|
|
||
| DictTrie dict_trie_; | ||
| HMMModel model_; | ||
|
|
||
| MixSegment mix_seg_; | ||
| }; // class JiebaSegmenter | ||
|
|
||
| } // namespace cppjieba | ||
|
|
||
|
|
||
| /// @brief Singleton wrapper around cppjieba::Jieba. | ||
| class JiebaSingleton : public cppjieba::JiebaSegmenter { | ||
| public: | ||
| /// @brief Returns the single instance, constructing on first call. | ||
| static JiebaSingleton& getInstance(const char* dictDir); | ||
|
|
||
| static JiebaSingleton& getInstance(); | ||
|
|
||
| /// @brief Do thread-safe segmentation and compute word end offsets. | ||
| /// @param text The input text in UTF-8 encoding. | ||
| /// @param wordEndOffsets Output vector to hold byte offsets of word ends. | ||
| void getWordEndOffsets(const std::string& text, std::vector<int>& wordEndOffsets); | ||
|
|
||
| // singleton bookkeeping | ||
| static JiebaSingleton* instance; | ||
| static std::once_flag initFlag; | ||
|
|
||
| private: | ||
| JiebaSingleton(const char* dictDir); ///< private ctor initializes base Jieba | ||
|
|
||
| /// Disable copy and move | ||
| JiebaSingleton(const JiebaSingleton&) = delete; | ||
| JiebaSingleton& operator = (const JiebaSingleton&) = delete; | ||
| JiebaSingleton(JiebaSingleton&&) = delete; | ||
| JiebaSingleton& operator = (JiebaSingleton&&) = delete; | ||
|
|
||
| std::mutex segMutex; ///< guards concurrent Cut() calls | ||
| }; | ||
|
|
||
| #ifdef _WIN32 | ||
| # define JIEBA_API __declspec(dllexport) | ||
| #else | ||
| # define JIEBA_API | ||
| #endif | ||
|
|
||
| extern "C" { | ||
|
|
||
| /// @brief Force singleton construction (load dicts, etc.) before any segmentation. | ||
| JIEBA_API bool initJieba(const char* dictDir); | ||
|
|
||
| /// @brief Segment UTF-8 text into character offsets. | ||
| /// @return 0 on success, -1 on failure. | ||
| JIEBA_API bool calculateWordOffsets(const char* text, int** wordEndOffsets, int* outLen); | ||
|
|
||
| /// Wrapper for word management | ||
| JIEBA_API bool insertUserWord(const char* word, int freq, const char* tag); | ||
| JIEBA_API bool deleteUserWord(const char* word, const char* tag); | ||
| JIEBA_API bool find(const char* word); | ||
|
|
||
| /// @brief Free memory allocated by calculateWordOffsets. | ||
| JIEBA_API void freeOffsets(int* ptr); | ||
|
|
||
| } // extern "C" | ||
|
|
||
| #endif // CPPJIEBA_DLL_H | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.