Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                




  • から
  • まで

1 - 40 件 / 55件

新着順 人気順

Datasetsの検索結果1 - 40 件 / 55件


Datasetsに関するエントリは55件あります。 機械学習datasetデータ などが関連タグです。 人気エントリには 『Papers with Code - Machine Learning Datasets』などがあります。
  • Papers with Code - Machine Learning Datasets

    CIFAR-10 (Canadian Institute for Advanced Research, 10 classes) The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but no

      Papers with Code - Machine Learning Datasets
    • litagin/moe-speech · Datasets at Hugging Face

      Not-For-All-Audiences This repository has been marked as containing sensitive content and may contain potentially harmful and sensitive information. View dataset card

        litagin/moe-speech · Datasets at Hugging Face
      • Papers With CodeのDatasets: 人気度まで分かるデータセット一覧サイト

        Papers With CodeのDatasets: 人気度まで分かるデータセット一覧サイト:AI・機械学習のデータセット辞典 データセットが効率よく見つけられるPapers With CodeのDatasetsを紹介。各データセットのページでは、データセット利用に向くタスクや、ベストな性能を発揮するモデル、コードありの論文、各ライブラリのデータローダー、データセットの人気傾向などを確認できる。 連載目次 最近、非常に有用な新しいデータセットの一覧サイトが登場したので紹介したい。 Papers With CodeのDatasetsとは? 「Papers With Code」というサイトをご存じだろうか? さまざまなタスク(例えば画像分類やテキスト生成など)に対して現時点でベストな性能を発揮する「機械学習モデル」や、スターの多い「コードあり論文」などをランキング形式で紹介してくれる、無料でオ

          Papers With CodeのDatasets: 人気度まで分かるデータセット一覧サイト
        • GitHub - google-research/deduplicate-text-datasets

          You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

            GitHub - google-research/deduplicate-text-datasets
          • Hugging FaceのDatasets: 自然言語処理のデータセット提供サイト

            英語ではあるが、詳細な説明は要らないだろう。簡単に概説しておくと、右側にはダウンロード数順で人気のデータセットが一覧表示されている。 キーワード検索できるだけでなく、左側の[Task Categories](タスクのカテゴリー:問題種別の大まかな大分類)/[Tasks](タスク:より具体的な問題種別)/[Languages](言語)/[Multilinguality](多言語性)/[Sizes](データサイズ)/[Licenses](ライセンス)でフィルタリングできる。 機械学習の際に「どのデータセットを使えばよいか」を悩むことはよくあると思うが、このランキング表示は非常に参考になるのではないだろうか。 各データセットのページ内容 図1のデータセット名(例えばwikitext)をクリックしてページを開くと、図2のように表示される。 これも直観的に把握できると思うので、細かな説明は不要だと思

              Hugging FaceのDatasets: 自然言語処理のデータセット提供サイト
            • GitHub - BandaiNamcoResearchInc/Bandai-Namco-Research-Motiondataset: This repository provides motion datasets collected by Bandai Namco Research Inc

              This repository provides motion datasets collected by Bandai Namco Research Inc. Find here for a README in Japanese. There is a long-standing interest in making diverse stylized motions for games and movies that pursue realistic and expressive character animation; however, creating new movements that include all the various styles of expression using existing methods is difficult. Due to this, Mot

                GitHub - BandaiNamcoResearchInc/Bandai-Namco-Research-Motiondataset: This repository provides motion datasets collected by Bandai Namco Research Inc
              • Free public datasets for COVID-19 | Google Cloud Blog

                COVID-19 public datasets: supporting organizations in their pandemic responseSee how organizations have used the BigQuery COVID-19 public dataset for research, healthcare, and more. By Johanna Katz • 5-minute read These datasets remove barriers and provide access to critical information quickly and easily, eliminating the need to search for and onboard large data files. Researchers can access the

                  Free public datasets for COVID-19 | Google Cloud Blog
                • Nerfgun3/bad_prompt · Datasets at Hugging Face

                  Negative Embedding / Textual Inversion Idea The idea behind this embedding was to somehow train the negative prompt as an embedding, thus unifying the basis of the negative prompt into one word or embedding. Side note: Embedding has proven to be very helpful for the generation of hands! :) Usage To use this embedding you have to download the file aswell as drop it into the "\stable-diffusion-webui

                    Nerfgun3/bad_prompt · Datasets at Hugging Face

                    LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETSby: Romain Beaumont, 31 Mar, 2022 We present a dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world - see also our NeurIPS2022 paper See our update on the LAION-5B dataset. Large image-text models like ALIGN, BASIC, Turing Bletchly, FLO

                    • CC-100: Monolingual Datasets from Web Crawl Data

                      This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated b

                      • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

                        In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from ra

                        • Overture Maps Foundation Releases General Availability of its Open Maps Datasets – Overture Maps Foundation

                          New data on 200 million+ addresses added in alpha release SAN FRANCISCO, Calif. —July, 24 2024 — The Overture Maps Foundation, a collaborative effort to enable current and next-generation interoperable open map services and products, today announced the General Availability (GA) of several of its global open maps datasets, paving the way for new and expanded use cases for a wide range of commercia

                          • kunishou/databricks-dolly-15k-ja · Datasets at Hugging Face

                            ヴァージン・オーストラリア航空は、2000年8月31日にヴァージン・ブルー航空として、2機の航空機で単一路線の運航を開始しました。 ヴァージン・オーストラリア航空(Virgin Australia Airlines Pty Ltd)はオーストラリアを拠点とするヴァージン・ブランドを冠する最大の船団規模を持つ航空会社です。2000年8月31日に、ヴァージン・ブルー空港として、2機の航空機、1つの空路を運行してサービスを開始しました。2001年9月のアンセット・オーストラリア空港の崩壊後、オーストラリアの国内市場で急速に地位を確立しました。その後はブリスベン、メルボルン、シドニーをハブとして、オーストラリア国内の32都市に直接乗り入れるまでに成長しました。

                              kunishou/databricks-dolly-15k-ja · Datasets at Hugging Face
                            • elyza/ELYZA-tasks-100 · Datasets at Hugging Face

                              傍若無人にふるまう人もいますが。)\n\n上記の文章を読んで、 に入れるのに最も適したものを以下の選択肢から選び、その理由を答えなさい。\n\n- だから\n- また\n- むしろ\n- もちろん"},"output":{"kind":"string","value":"文章中のに適した選択肢は「もちろん」です。\n\nの周辺では自発的な意志によって社会の秩序が保たれているという筆者の主張に対し、「傍若無人にふるまう人もいる」という反論を予想し予め掲げているため、「もちろん」が適切です。"},"eval_aspect":{"kind":"string","value":"- 選択肢を外している場合: -4点\n- 理由が的外れな場合: -2点\n- 理由の説明として(反論を)予想する旨が記述されていない場合: -1点\n"}}},{"rowIdx":7,"cells":{"input":{

                                elyza/ELYZA-tasks-100 · Datasets at Hugging Face
                              • GitHub - finos/perspective: A data visualization and analytics component, especially well-suited for large and/or streaming datasets.

                                You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                  GitHub - finos/perspective: A data visualization and analytics component, especially well-suited for large and/or streaming datasets.
                                • wiki40b  |  TensorFlow Datasets

                                  Deploy ML on mobile, microcontrollers and other edge devices

                                    wiki40b  |  TensorFlow Datasets
                                  • izumi-lab/llm-japanese-dataset · Datasets at Hugging Face

                                    「abc ~the first~」へようこそ!さて、ABC・・・と始まるアルファベットは、全部で何文字でしょう?

                                      izumi-lab/llm-japanese-dataset · Datasets at Hugging Face
                                    • hotchpotch/JQaRA · Datasets at Hugging Face


                                        hotchpotch/JQaRA · Datasets at Hugging Face
                                      • GitHub - roapi/roapi: Create full-fledged APIs for slowly moving datasets without writing a single line of code.

                                        You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                          GitHub - roapi/roapi: Create full-fledged APIs for slowly moving datasets without writing a single line of code.
                                        • Sharing new research, models, and datasets from Meta FAIR

                                          Sharing new research, models, and datasets from Meta FAIR Today, Meta FAIR is publicly releasing several new research artifacts in support of our goal of achieving advanced machine intelligence (AMI) while also supporting open science and reproducibility.The work we’re sharing today includes Meta Segment Anything 2.1 (SAM 2.1), an update to our popular Segment Anything Model 2 for images and video

                                            Sharing new research, models, and datasets from Meta FAIR
                                          • huggingface datasetsで使える日本語データセットのまとめ

                                            エントリ数:データセットサイズ: dataset.dataset_size の値。圧縮されたファイルを展開した後のサイズ。ダウンロードサイズ: dataset.download_size の値。ダウンロードするサイズ。cc100エントリ数: 458,387,942データセットサイズ: 82,042,212,602ダウンロードサイズ: 15,916,192,184ストリーミング: 不可オリジナルのデータセットは、空白行を挟んで 1 ブロックのようなテキストファイルになっているようですが、 datasets ライブラリでは、1 サンプル 1 行として読み込んでいるようです。 dataset = load_dataset('cc100', lang='ja', split='train') cc100 サンプル{'id': '0', 'text': '午後から雨が心配だったので遠出はせず、『ふれ

                                            • GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

                                              🤗 Datasets is a lightweight library providing two main features: one-line dataloaders for many public datasets: one-liners to download and pre-process any of the major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("squad"), get any of these datase

                                                GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
                                              • Pytorchで「月とすっぽん」の画像認識をしてみた(kerasのfrom_from_directryにあたるtorchvision.datasets.ImageFolder使用) - Qiita

                                                Deleted articles cannot be recovered. Draft of this article would be also deleted. Are you sure you want to delete this article?

                                                  Pytorchで「月とすっぽん」の画像認識をしてみた(kerasのfrom_from_directryにあたるtorchvision.datasets.ImageFolder使用) - Qiita
                                                • Sharing new research, models, and datasets from Meta FAIR

                                                  For more than a decade, Meta’s Fundamental AI Research (FAIR) team has focused on advancing the state of the art in AI through open research. As innovation in the field continues to move at a rapid pace, we believe that collaboration with the global AI community is more important than ever. Maintaining an open science approach and sharing our work with the community help us stay true to our goal o

                                                    Sharing new research, models, and datasets from Meta FAIR
                                                  • Hosting your Models and Datasets on Hugging Face Spaces using Streamlit

                                                    Hosting your Models and Datasets on Hugging Face Spaces using Streamlit Showcase your Datasets and Models using Streamlit on Hugging Face Spaces Streamlit allows you to visualize datasets and build demos of Machine Learning models in a neat way. In this blog post we will walk you through hosting models and datasets and serving your Streamlit applications in Hugging Face Spaces. Building demos for

                                                      Hosting your Models and Datasets on Hugging Face Spaces using Streamlit
                                                    • GitHub - robvanvolt/DALLE-datasets: This is a summary of easily available datasets for generalized DALLE-pytorch training.

                                                      You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                        GitHub - robvanvolt/DALLE-datasets: This is a summary of easily available datasets for generalized DALLE-pytorch training.
                                                      • Datasets

                                                        Opt-out request submitted by 09.02.2023 were excluded from this ersion of the dataset as well as initially flagged malicious files (not exhaustive). Datasets and data governance tools released by BigCode #The Stack: Exact deduplicated version of The Stack.The Stack dedup: Near deduplicated version of The Stack (recommended for training).The Stack issues: Collection of GitHub issues.The Stack Metad

                                                        • OpenAssistant/oasst1 · Datasets at Hugging Face

                                                          'Jew' or 'rabbi'"},"role":{"kind":"string","value":"assistant"},"lang":{"kind":"string","value":"en"},"review_count":{"kind":"number","value":3,"string":"3"},"review_result":{"kind":"string","value":"true"},"deleted":{"kind":"string","value":"false"},"rank":{"kind":"number","value":1,"string":"1"},"synthetic":{"kind":"string","value":"false"},"model_name":{"kind":"null"},"detoxify":{"kind":"string

                                                            OpenAssistant/oasst1 · Datasets at Hugging Face
                                                          • GitHub - llm-jp/text2dataset: Easily turn large English text datasets into Japanese text datasets using open LLMs.

                                                            You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                              GitHub - llm-jp/text2dataset: Easily turn large English text datasets into Japanese text datasets using open LLMs.
                                                            • Well-tuned Simple Nets Excel on Tabular Datasets

                                                              Tabular datasets are the last "unconquered castle" for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularizati

                                                              • Object Detection Datasets

                                                                Roboflow hosts free public computer vision datasets in many popular formats (including CreateML JSON, COCO JSON, Pascal VOC XML, YOLO v3, and Tensorflow TFRecords). For your convenience, we also have downsized and augmented versions available. If you'd like us to host your dataset, please get in touch.

                                                                  Object Detection Datasets
                                                                • TensorFlow Datasets: The Bad Parts

                                                                  TLDR: TensorFlow’s tf.data API is a popular approach to loading data into deep learning models. Although tf.data has a lot of powerful features, it is built around sequential access to the underlying data set. This design makes it difficult to efficiently shuffle large data sets, to shard data when doing distributed training, and to implement fault-tolerant training. We argue that random access sh

                                                                    TensorFlow Datasets: The Bad Parts
                                                                  • bigcode/the-stack · Datasets at Hugging Face

                                                                    Terms of Use for The Stack\nThe Stack dataset is a collection of source code in over 300 programming languages. We ask that you read and acknowledge the following points before using the dataset: \n\nThe Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including a

                                                                      bigcode/the-stack · Datasets at Hugging Face
                                                                    • A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys 2020) 読んだ - 糞糞糞ネット弁慶

                                                                      論文 Gunosy と理研AIPの論文. 企業が持つ implicit feedback のデータを公開するためには アクティブユーザ数や収益や平均クリック数といった business metric を隠したい 公平性を担保したい Population Bias を減らしたい という三つの気持ちがある. 今回はログ中のユーザをサンプリングして公開用データを構築するわけですが,ユーザごとにサンプリング時の重み w を推定する問題として定式化する. この時 business metric を隠すために,サンプリング後のクリック数の分布と特定の分布 (zipf など) との Wasserstein distance を取る 公平性のためにサンプリング後のユーザの属性の分布と uniform distribution との KL divergence Population Bias 対策でサンプリ

                                                                        A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys 2020) 読んだ - 糞糞糞ネット弁慶
                                                                      • PandaSet Open Datasets - Scale

                                                                        Scene #1Scene #2Scene #3Scene #4Scene #5Scene #6Scene #7Scene #8

                                                                          PandaSet Open Datasets - Scale
                                                                        • GitHub - NVIDIA-Merlin/NVTabular: NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

                                                                          NVTabular is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS Dask-cuDF library. NVTabular is a component of NVIDIA Merlin, an open source framework for build

                                                                            GitHub - NVIDIA-Merlin/NVTabular: NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
                                                                          • GitHub - langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

                                                                            You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                                              GitHub - langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
                                                                            • Reindex, Transform, and Aggregate datasets using pandas Library

                                                                              Most of the time, the dataset we will get from the business will be dirty and cannot be used straight forward to train machine learning models. Therefore, we must treat the dataset and bring it to the desired form to input it into an algorithm. This tutorial discusses reindexing, transforming, and aggregating datasets in Pandas. What are Reindexing, Transforming, and Aggregating?Reindexing, transf

                                                                                Reindex, Transform, and Aggregate datasets using pandas Library
                                                                              • Shota Imai@えるエル on Twitter: "既に機械学習界のインフラ的サイトになっていたPaper With Codeに続いて,論文のデータセットもまとめた「Paper with Datasets」(名前はツイッターでの呼称で,サイト自体は同じ)ができたようです… https://t.co/suec5vimBx"

                                                                                既に機械学習界のインフラ的サイトになっていたPaper With Codeに続いて,論文のデータセットもまとめた「Paper with Datasets」(名前はツイッターでの呼称で,サイト自体は同じ)ができたようです… https://t.co/suec5vimBx

                                                                                  Shota Imai@えるエル on Twitter: "既に機械学習界のインフラ的サイトになっていたPaper With Codeに続いて,論文のデータセットもまとめた「Paper with Datasets」(名前はツイッターでの呼称で,サイト自体は同じ)ができたようです… https://t.co/suec5vimBx"
                                                                                • GitHub - spiceai/spiceai: A self-hostable CDN for databases. Spice provides a unified SQL query interface and portable runtime to locally materialize, accelerate, and query datasets across databases, data warehouses, and data lakes.

                                                                                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                                                    GitHub - spiceai/spiceai: A self-hostable CDN for databases. Spice provides a unified SQL query interface and portable runtime to locally materialize, accelerate, and query datasets across databases, data warehouses, and data lakes.
