Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication / Decontamination #174

Open
chschroeder opened this issue Jun 27, 2024 · 0 comments
Open

Deduplication / Decontamination #174

chschroeder opened this issue Jun 27, 2024 · 0 comments

Comments

@chschroeder
Copy link

Hi,

dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.

Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:

What is the problem?

Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the desired_false_positive_rate overlap_threshold parameter, but without any success.

{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL1"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL2"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL3"}

Infos about my setup:

I am using the latest dolma 1.0.3 release. My latest minimum working example is based on configs/dolma-v1_5/decontamination.

Here are my config files create-bloomfilter.yaml:
documents:
  - benchmarks.jsonl.gz  # these are the files I want to filter with the decontamination step

dedupe:
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
  skip_empty: true

bloom_filter:
  read_only: false
  estimated_doc_count: 73543
  #size_in_bytes: 104857  # 100 MB; smaller causes too many FPs
  desired_false_positive_rate: 1e-3  # TOD: 1e-15
  file: decontamination_bloom_filter.bin

processes: 4 

decontaminate.yaml:

documents:
  - tmp/v0/documents/*.gz

work_dir:
  input: work/para/input
  output: work/para/output

dedupe:
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
  skip_empty: true

bloom_filter:
  read_only: true
  estimated_doc_count: 288347 
  desired_false_positive_rate: 1e-3
  file: decontamination_bloom_filter.bin

processes: 3
Here is the output dolma -c create-bloomfilter.yaml dedupe
bloom_filter:
  desired_false_positive_rate: 0.001
  estimated_doc_count: 73543
  file: decontamination_bloom_filter.bin
  read_only: false
  size_in_bytes: 0
dedupe:
  min_length: 0
  min_words: 0
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
    by_ngram:
      ngram_length: 0
      overlap_threshold: 1.0
      skip_short_paragraphs: false
      stride: 0
    paragraph_separator: '

      '
  skip_empty: true
documents:
- benchmarks.jsonl.gz
processes: 4
work_dir:
  input: /tmp/dolma-input-1rmq0gbx
  output: /tmp/dolma-output-ky8van2k
[2024-06-27T12:34:26Z INFO  dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO  dolma::deduper] Skipping "/disk/cschroeder/workspaces/dolma/benchmarks.jsonl.gz" because it already exists
[2024-06-27T12:34:26Z INFO  dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO  dolma::deduper] Bloom filter written.
[2024-06-27T12:34:26Z INFO  dolma::deduper] Done!

dolma -c decontaminate.yaml dedupe

bloom_filter:
  desired_false_positive_rate: 0.1
  estimated_doc_count: 288347
  file: decontamination_bloom_filter.bin
  read_only: true
  size_in_bytes: 0
dedupe:
  min_length: 0
  min_words: 0
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
    by_ngram:
      ngram_length: 0
      overlap_threshold: 1.0
      skip_short_paragraphs: false
      stride: 0
    paragraph_separator: '

      '
  skip_empty: true
documents:
- tmp/v0/documents/*.gz
processes: 3
work_dir:
  input: work/para/input
  output: work/para/output
[2024-06-27T12:38:17Z INFO  dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0000.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0001.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0002.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0003.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0004.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Bloom filter written.
[2024-06-27T12:38:22Z INFO  dolma::deduper] Done!

Am I missing somehting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant