-
-
Notifications
You must be signed in to change notification settings - Fork 738
Introduce cppjieba as a submodule for Chinese word segmentation #18548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce cppjieba as a submodule for Chinese word segmentation #18548
Conversation
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents
|
I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work |
|
Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional. |
Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users. |
As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time. |
|
@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options? |
Co-authored-by: Sean Budd <[email protected]>
Co-authored-by: Sean Budd <[email protected]>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Co-authored-by: Sean Budd <[email protected]>
|
@CrazySteve0605 - could you please update the PR description with what was requested in #18548 (comment), some more information on the module chosen? Please also make sure to mark the PR as ready for review if it as this stage. We are hoping to see implementation of segmentation with an example TextInfo such as |
|
@seanbudd OK I see. I'm adding more details to make the main comment more informative. Instead of extending 'TextInfos', I'm creating a separate 'WordSegmentationStrategy' and 'WordSegmenter' that follow the strategy pattern, selecting an appropriate segmenter based on the Unicode fields present in the given text. I think this approach will be more extensible and easier to reuse in other potential components, such as braille output. I also plan to submit it as my next PR. |
d80388d to
b3e08ee
Compare
|
this pr is pretty much good. Just two small things in my last review. |
984b6eb to
194a69e
Compare
Co-authored-by: Sean Budd <[email protected]>
5562e70 to
fec70a9
Compare
fec70a9 to
b327e23
Compare
a3a1815
into
nvaccess:try-chineseWordSegmentation-staging
…eText ### Link to issue number: blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Chinese text can be navigated by word via system caret or review cursor. ### Description of developer facing changes: ### Description of development approach: - [ ] update `textUtils` - [x] add `WordSegment` module - [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - [x] implement `ChineseWordSegmentationStrategy` (for Chinese text) - [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy) - [x] update `textUtils/__init__.py` - [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies - [x] update `textInfos/offsets.py` - [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - [x] integrate `WordSegmenter` for calculating word offsets - [ ] update document ### Testing strategy: ### Known issues with pull request: Word segmentation functionality was integrated in `OffsetsTextInfo`. In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back. Subclasses of OffsetsTextInfo #### Supported 1. `NVDAObjectTextInfo` 2. `InputCompositionTextInfo`: 3. `JABTextInfo` 4. `SimpleResultTextInfo` 5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets` #### Unsupported 1. `DisplayModelTextInfo`: straightly disable 2. `EditTextInfo`: straightly use Uniscribe 3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++ source/NVDAObjects/window/scintilla.py 4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py 5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame source/appModules/powerpnt.py 6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py #### Partial Supported Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity. 1. `IA2TextTextInfo`: override and fall back source/NVDAObjects/IAccessible/\_\_init\_\_.py ### Code Review Checklist: <!-- This checklist is a reminder of things commonly forgotten in a new PR. Authors, please do a self-review of this pull-request. Check items to confirm you have thought about the relevance of the item. Where items are missing (eg unit / system tests), please explain in the PR. To check an item `- [ ]` becomes `- [x]`, note spacing. You can also check the checkboxes after the PR is created. A detailed explanation of this checklist is available here: https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist --> - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.
Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.
cppjieba is the C++ version of the popular Python library jieba, offering near-identical runtime performance.
A detialed integration analysis
1. Current State of Chinese Segmentation Tools
Chinese segmentation techniques trade accuracy for runtime and memory in predictable ways:
2. NVDA-Specific Requirements
NVDA requires low latency, a small memory footprint, and reliable offline operation — which rules out heavyweight deep-learning models. At the same time, very lightweight pure dictionary- or rule-based segmenters are also unsuitable: their lower accuracy and poor OOV handling lead to frequent mis-segmentation, inaccurate speech output, and increased user confusion in real-world documents. cppjieba therefore represents a pragmatic middle ground — native C++ performance and modest resource usage combined with a hybrid strategy (dictionary lookup plus HMM-based sequence labeling) that delivers appreciably better accuracy and OOV mitigation than pure dictionary methods, while remaining far lighter and faster than Transformer/BERT approaches — making it a sensible choice for improving Chinese text handling in NVDA without imposing large runtime or memory penalties.
Link to issue number:
Related to #4075 and a part of OSPP 2025 of NVDA.
Summary of the issue:
NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.
Description of user facing changes:
None
Description of developer facing changes:
A tool to implement word navigation and braille output within Chinese content.
Description of development approach:
Testing strategy:
Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.
Known issues with pull request:
Code Review Checklist:
@coderabbitai summary