Skip to content

Conversation

@CrazySteve0605
Copy link
Contributor

@CrazySteve0605 CrazySteve0605 commented Jul 24, 2025

Introduce cppjieba, an NLP-based Chinese tokenizer, for implementing Chinese word navigation and braille output.

cppjieba is the C++ version of the popular Python library jieba, offering near-identical runtime performance.

A detialed integration analysis

1. Current State of Chinese Segmentation Tools

Chinese segmentation techniques trade accuracy for runtime and memory in predictable ways:

  1. Dictionary-based (fast, lower OOV coverage). Libraries such as Jieba rely on prefix trees (tries) and greedy algorithms (maximum matching / DAG). These methods are CPU-efficient and lightweight but commonly miss new or domain-specific tokens (OOV words).
  2. Statistical / sequence-labeling (better context, moderate cost). Approaches using HMMs, CRFs or perceptrons (adopted by tools like THULAC, LTP, pkuseg) label characters with B/I/E/S tags to model local context and transitions. They substantially improve OOV handling and achieve high F1 on news/text benchmarks, at the cost of higher computation and memory compared with pure dictionary methods.
  3. Deep learning (highest accuracy, highest cost). Transformer/BERT-based models offer superior contextual understanding and segmentation accuracy, but their inference latency and memory footprint (often hundreds of megabytes to multiple gigabytes) make them unsuitable for real-time, low-latency assistive software like NVDA.

2. NVDA-Specific Requirements

NVDA requires low latency, a small memory footprint, and reliable offline operation — which rules out heavyweight deep-learning models. At the same time, very lightweight pure dictionary- or rule-based segmenters are also unsuitable: their lower accuracy and poor OOV handling lead to frequent mis-segmentation, inaccurate speech output, and increased user confusion in real-world documents. cppjieba therefore represents a pragmatic middle ground — native C++ performance and modest resource usage combined with a hybrid strategy (dictionary lookup plus HMM-based sequence labeling) that delivers appreciably better accuracy and OOV mitigation than pure dictionary methods, while remaining far lighter and faster than Transformer/BERT approaches — making it a sensible choice for improving Chinese text handling in NVDA without imposing large runtime or memory penalties.

Link to issue number:

Related to #4075 and a part of OSPP 2025 of NVDA.

Summary of the issue:

NVDA’s current word navigation mechanism relies on Unicode boundary rules through the Uniscribe API, which do not work well for languages such as Chinese due to the absence of explicit word delimiters.

Description of user facing changes:

None

Description of developer facing changes:

A tool to implement word navigation and braille output within Chinese content.

Description of development approach:

  • Added cppjieba as a submodule.
  • Added its wrapper and building script.

Testing strategy:

Confirm weather it can be successfully compiled and its segmentation function can be called by ctypes.

Known issues with pull request:

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@coderabbitai summary

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide
  robust Chinese word segmentation capabilities.
- Update `.gitmodules` to point to the official `cppjieba` repository and
  configure it to track the `master` branch.
- Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp'
- Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the
  project’s existing license, ensuring proper attribution and compliance.
- Update documents
@seanbudd seanbudd requested a review from michaelDCurran July 24, 2025 02:29
@CrazySteve0605 CrazySteve0605 marked this pull request as ready for review July 24, 2025 09:05
@CrazySteve0605 CrazySteve0605 requested a review from a team as a code owner July 24, 2025 09:05
@wmhn1872265132
Copy link
Contributor

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

@cary-rowen
Copy link
Contributor

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

@CrazySteve0605
Copy link
Contributor Author

CrazySteve0605 commented Jul 24, 2025

I downloaded the launcher built by GitHUB Actions for testing and it doesn't seem to work

Apologies for the insufficient explanation. This PR is intended as an initial step of the overall work. It does not implement any segmentation functionality yet, but merely introduces the ‘cppjieba’ module. Therefore, it has no impact on end users.

@CrazySteve0605
Copy link
Contributor Author

Imo, I suggest that initial development and debugging be carried out locally, especially during the stage when functionalities are not yet operational. If extensive testing by community early users is needed, at least new features should be functional.

As many parts of the code need changes, splitting the work into smaller tasks might be more efficient, as the community can review them at the same time.

@seanbudd
Copy link
Member

@CrazySteve0605 - can you please provide more information on why this library was selected over others? what were alternative options?

@CrazySteve0605 CrazySteve0605 marked this pull request as draft August 4, 2025 00:32
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/'
  - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method
  - Implements Meyers’ singleton via `getInstance()` with a private constructor
  - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance
- Add C-style API in the same module:
  - `int initJieba()` to force singleton initialization
  - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets
  - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched.
- This will cause the submodule of sonic to be fetched as well, which seems currently unused.
@seanbudd
Copy link
Member

@CrazySteve0605 - could you please update the PR description with what was requested in #18548 (comment), some more information on the module chosen?

Please also make sure to mark the PR as ready for review if it as this stage.

We are hoping to see implementation of segmentation with an example TextInfo such as WordDocumentTextInfo or Gecko_ia2_TextInfo

@CrazySteve0605
Copy link
Contributor Author

@seanbudd OK I see. I'm adding more details to make the main comment more informative. Instead of extending 'TextInfos', I'm creating a separate 'WordSegmentationStrategy' and 'WordSegmenter' that follow the strategy pattern, selecting an appropriate segmenter based on the Unicode fields present in the given text. I think this approach will be more extensible and easier to reuse in other potential components, such as braille output. I also plan to submit it as my next PR.

@CrazySteve0605 CrazySteve0605 marked this pull request as ready for review September 12, 2025 11:10
@michaelDCurran michaelDCurran changed the base branch from master to try-chineseWordSegmentation-staging September 17, 2025 01:33
@michaelDCurran
Copy link
Member

this pr is pretty much good. Just two small things in my last review.
We have decided that this and any other related PRs for this OSSP project will be merged into a special try-wordSegmentation-staging branch rather than master. So that all PRs can be reviewed and merged in to one working branch, which will then if successful, be merged to master all at the same time. Therefore I have switched the base branch of this pr to try-wordSegmentation-staging accordingly.

@CrazySteve0605 CrazySteve0605 requested a review from a team as a code owner September 20, 2025 16:18
@CrazySteve0605 CrazySteve0605 requested review from Qchristensen and removed request for a team September 20, 2025 16:18
@CrazySteve0605 CrazySteve0605 force-pushed the integrateCPPJieba branch 2 times, most recently from 5562e70 to fec70a9 Compare September 26, 2025 12:49
@michaelDCurran michaelDCurran merged commit a3a1815 into nvaccess:try-chineseWordSegmentation-staging Sep 29, 2025
38 checks passed
@github-actions github-actions bot added this to the 2026.1 milestone Sep 29, 2025
@michaelDCurran
Copy link
Member

Please note, that I incorrectly had merged this as a squash commit, which caused problems for dependent PRs such as #18735.
I have now reverted that squash commit and merged this PR's branch unsquashed as commit 093a825

michaelDCurran added a commit that referenced this pull request Oct 28, 2025
…eText

### Link to issue number:
blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA.

### Summary of the issue:

### Description of user facing changes:
Chinese text can be navigated by word via system caret or review cursor.
### Description of developer facing changes:

### Description of development approach:
- [ ] update `textUtils`
	- [x] add `WordSegment` module
		- [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern
		- [x] implement `ChineseWordSegmentationStrategy` (for Chinese text)
		- [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
	- [x] update `textUtils/__init__.py`
		- [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- [x] update `textInfos/offsets.py`
	- [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions
	- [x] integrate `WordSegmenter` for calculating word offsets
- [ ] update document

### Testing strategy:

### Known issues with pull request:
Word segmentation functionality was integrated in `OffsetsTextInfo`.
In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back.

Subclasses of OffsetsTextInfo

#### Supported
1. `NVDAObjectTextInfo`
2. `InputCompositionTextInfo`:
3. `JABTextInfo`
4. `SimpleResultTextInfo`
5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets`

#### Unsupported
1. `DisplayModelTextInfo`: straightly disable
2. `EditTextInfo`: straightly use Uniscribe
3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++  source/NVDAObjects/window/scintilla.py
4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py
5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame  source/appModules/powerpnt.py
6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py

#### Partial Supported
Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity.
1.  `IA2TextTextInfo`: override and fall back  source/NVDAObjects/IAccessible/\_\_init\_\_.py

### Code Review Checklist:

<!--
This checklist is a reminder of things commonly forgotten in a new PR.
Authors, please do a self-review of this pull-request.
Check items to confirm you have thought about the relevance of the item.
Where items are missing (eg unit / system tests), please explain in the PR.
To check an item `- [ ]` becomes `- [x]`, note spacing.
You can also check the checkboxes after the PR is created.
A detailed explanation of this checklist is available here:
https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist
-->

- [ ] Documentation:
  - Change log entry
  - User Documentation
  - Developer / Technical Documentation
  - Context sensitive help for GUI changes
- [ ] Testing:
  - Unit tests
  - System (end to end) tests
  - Manual testing
- [ ] UX of all users considered:
  - Speech
  - Braille
  - Low Vision
  - Different web browsers
  - Localization in other languages / culture than English
- [ ] API is compatible with existing add-ons.
- [ ] Security precautions taken.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants