Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

BertChunker: Efficient and Trained Chunking for Unstructured documents. 训练Bert做文档分段.

License

Notifications You must be signed in to change notification settings

jackfsuia/BertChunker

Repository files navigation

BertChunker: Efficient and Trained Chunking for Unstructured Documents

Model | Paper

BertChunker is a text chunker based on BERT with a classifier head to predict the start token of chunks (for use in RAG, etc). It is finetuned based on sentence-transformers/all-MiniLM-L6-v2, and the whole training lasted for 10 minutes on a Nvidia P40 GPU on a 50 MB synthetized dataset. This repo includes codes for model defining, generating dataset, training and testing.

Generate dataset

See generate_dataset.ipynb

Train from the base model all-MiniLM-L6-v2

Run

bash train.sh

Inference

See test.py

Citation

If this work is helpful, please kindly cite as:

@article{BertChunker,
  title={BertChunker: Efficient and Trained Chunking for Unstructured Documents}, 
  author={Yannan Luo},
  year={2024},
  url={https://github.com/jackfsuia/BertChunker}
}

About

BertChunker: Efficient and Trained Chunking for Unstructured documents. 训练Bert做文档分段.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published