Introduce cppjieba as a submodule for Chinese word segmentation #18548

CrazySteve0605 · 2025-07-24T02:27:21Z

Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.

cppjieba is the C++ version of the popular Python library jieba, offering near-identical runtime performance.

A detialed integration analysis

1. Current State of Chinese Segmentation Tools

Chinese segmentation techniques trade accuracy for runtime and memory in predictable ways:

Dictionary-based (fast, lower OOV coverage). Libraries such as Jieba rely on prefix trees (tries) and greedy algorithms (maximum matching / DAG). These methods are CPU-efficient and lightweight but commonly miss new or domain-specific tokens (OOV words).
Statistical / sequence-labeling (better context, moderate cost). Approaches using HMMs, CRFs or perceptrons (adopted by tools like THULAC, LTP, pkuseg) label characters with B/I/E/S tags to model local context and transitions. They substantially improve OOV handling and achieve high F1 on news/text benchmarks, at the cost of higher computation and memory compared with pure dictionary methods.
Deep learning (highest accuracy, highest cost). Transformer/BERT-based models offer superior contextual understanding and segmentation accuracy, but their inference latency and memory footprint (often hundreds of megabytes to multiple gigabytes) make them unsuitable for real-time, low-latency assistive software like NVDA.

2. NVDA-Specific Requirements

NVDA requires low latency, a small memory footprint, and reliable offline operation — which rules out heavyweight deep-learning models. At the same time, very lightweight pure dictionary- or rule-based segmenters are also unsuitable: their lower accuracy and poor OOV handling lead to frequent mis-segmentation, inaccurate speech output, and increased user confusion in real-world documents. cppjieba therefore represents a pragmatic middle ground — native C++ performance and modest resource usage combined with a hybrid strategy (dictionary lookup plus HMM-based sequence labeling) that delivers appreciably better accuracy and OOV mitigation than pure dictionary methods, while remaining far lighter and faster than Transformer/BERT approaches — making it a sensible choice for improving Chinese text handling in NVDA without imposing large runtime or memory penalties.

Link to issue number:

Related to #4075 and a part of OSPP 2025 of NVDA.

Summary of the issue:

NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.

Description of user facing changes:

None

Description of developer facing changes:

A tool to implement word navigation and braille output within Chinese content.

Description of development approach:

Added cppjieba as a submodule.
Added its wrapper and building script.

Testing strategy:

Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.

Known issues with pull request:

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

@coderabbitai summary

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents

wmhn1872265132 · 2025-07-24T09:16:44Z

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

cary-rowen · 2025-07-24T09:23:24Z

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

CrazySteve0605 · 2025-07-24T09:28:34Z

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users.

CrazySteve0605 · 2025-07-24T09:39:42Z

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time.

seanbudd · 2025-07-31T05:27:42Z

@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options?

nvdaHelper/local/sconscript

include/readme.md

projectDocs/dev/createDevEnvironment.md

Co-authored-by: Sean Budd <[email protected]>

- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer

- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.

nvdaHelper/cppjieba/cppjieba.cpp

nvdaHelper/cppjieba/cppjieba.hpp

include/readme.md

nvdaHelper/cppjieba/cppjieba.cpp

nvdaHelper/cppjieba/cppjieba.hpp

nvdaHelper/cppjieba/sconscript

user_docs/en/changes.md

Co-authored-by: Sean Budd <[email protected]>

seanbudd · 2025-08-15T06:43:43Z

@CrazySteve0605 - could you please update the PR description with what was requested in #18548 (comment), some more information on the module chosen?

Please also make sure to mark the PR as ready for review if it as this stage.

We are hoping to see implementation of segmentation with an example TextInfo such as WordDocumentTextInfo or Gecko_ia2_TextInfo

CrazySteve0605 · 2025-08-15T07:36:43Z

@seanbudd OK I see. I'm adding more details to make the main comment more informative. Instead of extending 'TextInfos', I'm creating a separate 'WordSegmentationStrategy' and 'WordSegmenter' that follow the strategy pattern, selecting an appropriate segmenter based on the Unicode fields present in the given text. I think this approach will be more extensible and easier to reuse in other potential components, such as braille output. I also plan to submit it as my next PR.

projectDocs/dev/createDevEnvironment.md

nvdaHelper/cppjieba/cppjieba.hpp

michaelDCurran · 2025-09-17T01:37:41Z

this pr is pretty much good. Just two small things in my last review.
We have decided that this and any other related PRs for this OSSP project will be merged into a special try-wordSegmentation-staging branch rather than master. So that all PRs can be reviewed and merged in to one working branch, which will then if successful, be merged to master all at the same time. Therefore I have switched the base branch of this pr to try-wordSegmentation-staging accordingly.

.gitattributes

projectDocs/dev/createDevEnvironment.md

This reverts commit 06070c1.

Co-authored-by: Sean Budd <[email protected]>

…ieba

michaelDCurran · 2025-09-29T02:13:14Z

Please note, that I incorrectly had merged this as a squash commit, which caused problems for dependent PRs such as #18735.
I have now reverted that squash commit and merged this PR's branch unsquashed as commit 093a825

…eText ### Link to issue number: blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Chinese text can be navigated by word via system caret or review cursor. ### Description of developer facing changes: ### Description of development approach: - [ ] update `textUtils` - [x] add `WordSegment` module - [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - [x] implement `ChineseWordSegmentationStrategy` (for Chinese text) - [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy) - [x] update `textUtils/__init__.py` - [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies - [x] update `textInfos/offsets.py` - [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - [x] integrate `WordSegmenter` for calculating word offsets - [ ] update document ### Testing strategy: ### Known issues with pull request: Word segmentation functionality was integrated in `OffsetsTextInfo`. In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back. Subclasses of OffsetsTextInfo #### Supported 1. `NVDAObjectTextInfo` 2. `InputCompositionTextInfo`: 3. `JABTextInfo` 4. `SimpleResultTextInfo` 5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets` #### Unsupported 1. `DisplayModelTextInfo`: straightly disable 2. `EditTextInfo`: straightly use Uniscribe 3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++ source/NVDAObjects/window/scintilla.py 4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py 5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame source/appModules/powerpnt.py 6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py #### Partial Supported Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity. 1. `IA2TextTextInfo`: override and fall back source/NVDAObjects/IAccessible/\_\_init\_\_.py ### Code Review Checklist:  - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.

seanbudd requested a review from michaelDCurran July 24, 2025 02:29

CrazySteve0605 added 2 commits July 24, 2025 16:51

Update what's new

fb4efef

Add comments for building script of cppjieba and its dependency

ae58e9b

CrazySteve0605 marked this pull request as ready for review July 24, 2025 09:05

CrazySteve0605 requested a review from a team as a code owner July 24, 2025 09:05

seanbudd reviewed Jul 31, 2025

View reviewed changes

nvdaHelper/local/sconscript Outdated Show resolved Hide resolved

include/readme.md Show resolved Hide resolved

projectDocs/dev/createDevEnvironment.md Outdated Show resolved Hide resolved

CrazySteve0605 and others added 3 commits August 1, 2025 13:04

Update projectDocs/dev/createDevEnvironment.md

06070c1

Co-authored-by: Sean Budd <[email protected]>

Update include/readme.md

2273a60

Co-authored-by: Sean Budd <[email protected]>

Remove changes in sconscript for localLIb

3d4d9f1

CrazySteve0605 marked this pull request as draft August 4, 2025 00:32

CrazySteve0605 added 7 commits August 4, 2025 09:59

add building script for cppjieba

1fbf05f

Merge branch 'master' into integrateCPPJieba

f4cab8a

Merge branch 'master' into integrateCPPJieba

d4c3a92

Update .gitignore for cppjieba

da662be

Update building and setup script for cppjieba's dicts installation

38a12dc

gerald-hartig reviewed Aug 8, 2025

View reviewed changes

nvdaHelper/cppjieba/cppjieba.cpp Outdated Show resolved Hide resolved

gerald-hartig reviewed Aug 8, 2025

View reviewed changes

nvdaHelper/cppjieba/cppjieba.hpp Outdated Show resolved Hide resolved

seanbudd reviewed Aug 8, 2025

View reviewed changes

include/readme.md Outdated Show resolved Hide resolved

nvdaHelper/cppjieba/cppjieba.cpp Outdated Show resolved Hide resolved

nvdaHelper/cppjieba/cppjieba.hpp Show resolved Hide resolved

nvdaHelper/cppjieba/sconscript Show resolved Hide resolved

user_docs/en/changes.md Show resolved Hide resolved

CrazySteve0605 and others added 2 commits August 9, 2025 20:30

update copyright headers based on @seanbudd's suggestions

c60c2da

Update include/readme.md

c853b64

Co-authored-by: Sean Budd <[email protected]>

CrazySteve0605 added 3 commits September 7, 2025 19:04

fix building script

09b1890

Merge branch 'master' into integrateCPPJieba

bec5dc5

update helper of coojieba

b3e08ee

CrazySteve0605 force-pushed the integrateCPPJieba branch from d80388d to b3e08ee Compare September 9, 2025 16:45

CrazySteve0605 marked this pull request as ready for review September 12, 2025 11:10

michaelDCurran requested changes Sep 17, 2025

View reviewed changes

projectDocs/dev/createDevEnvironment.md Show resolved Hide resolved

nvdaHelper/cppjieba/cppjieba.hpp Outdated Show resolved Hide resolved

michaelDCurran changed the base branch from master to try-chineseWordSegmentation-staging September 17, 2025 01:33

seanbudd reviewed Sep 17, 2025

View reviewed changes

.gitattributes Outdated Show resolved Hide resolved

projectDocs/dev/createDevEnvironment.md Show resolved Hide resolved

CrazySteve0605 mentioned this pull request Sep 18, 2025

Add Word Separaters in Braille Output for Chinese text #18865

Merged

5 tasks

CrazySteve0605 added 3 commits September 20, 2025 23:53

Merge branch 'master' into integrateCPPJieba

0f507d5

Revert "Update projectDocs/dev/createDevEnvironment.md"

c2cbb24

This reverts commit 06070c1.

avoid using compilation time path

194a69e

CrazySteve0605 force-pushed the integrateCPPJieba branch from 984b6eb to 194a69e Compare September 20, 2025 16:18

CrazySteve0605 requested a review from a team as a code owner September 20, 2025 16:18

CrazySteve0605 requested review from Qchristensen and removed request for a team September 20, 2025 16:18

CrazySteve0605 and others added 2 commits September 21, 2025 00:20

Update .gitattributes

2e730d6

Co-authored-by: Sean Budd <[email protected]>

Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…

30e855f

…ieba

CrazySteve0605 requested a review from michaelDCurran September 24, 2025 09:10

CrazySteve0605 force-pushed the integrateCPPJieba branch 2 times, most recently from 5562e70 to fec70a9 Compare September 26, 2025 12:49

Merge branch 'try-chineseWordSegmentation-staging' into integrateCPPJ…

b327e23

…ieba

michaelDCurran force-pushed the integrateCPPJieba branch from fec70a9 to b327e23 Compare September 28, 2025 22:38

michaelDCurran approved these changes Sep 29, 2025

View reviewed changes

michaelDCurran merged commit a3a1815 into nvaccess:try-chineseWordSegmentation-staging Sep 29, 2025
38 checks passed

github-actions bot added this to the 2026.1 milestone Sep 29, 2025

seanbudd mentioned this pull request Nov 5, 2025

[WIP] Merge Chinese Word Segmentation work #19166

Draft

Uh oh!

Introduce cppjieba as a submodule for Chinese word segmentation #18548

Introduce cppjieba as a submodule for Chinese word segmentation #18548

Uh oh!

Conversation

CrazySteve0605 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A detialed integration analysis

1. Current State of Chinese Segmentation Tools

2. NVDA-Specific Requirements

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

wmhn1872265132 commented Jul 24, 2025

Uh oh!

cary-rowen commented Jul 24, 2025

Uh oh!

CrazySteve0605 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CrazySteve0605 commented Jul 24, 2025

Uh oh!

seanbudd commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanbudd commented Aug 15, 2025

Uh oh!

CrazySteve0605 commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

michaelDCurran commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelDCurran commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading

CrazySteve0605 commented Jul 24, 2025 •

edited

Loading