[B! dataset] fubaのブックマーク

fuba id:fuba

datasetに関するfubaのブックマーク (19)

大規模画像データセット - n_hidekeyの日記
最近は画像認識・検索で用いられるデータセットも大規模化が進んでいます。いくつか代表的なものや最近見つけたものをまとめてみます。（ここでの目安は、教師つきデータは10万枚以上、教師なしデータは100万枚以上のもの） ImageNet http://www.image-net.org/ 自然言語処理の分野で有名なWordNetのオントロジーに従って、各単語（今のところ名詞のみ）に対応する画像を収集したものです。Amazon Mechanical Turk を利用し、質の高いデータセットを構築するように工夫されています。日々データは蓄積・更新されており、2012年1月現在、約1400万枚の画像データ（2万2千カテゴリ）が集まっているようです。アノテーションは基本的に1画像1カテゴリで、一部の画像には物体の位置を示すbounding boxもついています。カテゴリによっては十分な数の画像がな
fuba 2012/01/16
dataset

cv
リンク
http://buzzdata.com/
fuba 2011/09/21
データセット版 github ？

dataset
リンク
Tweets2011 Twitter Collection
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusa ble, representative sample of the twittersphere - i.e. both important and spam tweets are included. The Tweets2011 corpus is unusual in that what you get is a list of tweet identifiers, and the actual twe
fuba 2011/09/02
TREC 2011 microblog track で使ったコーパス、ID だけくれるので自前でクロール、だるそうだけど作っといた方がいいのかな…

twitter

corpus

dataset
リンク
http://theinfo.org/get/data
fuba 2011/08/01
dataset
リンク
大規模データを無料で手に入れることのできるサイトまとめ - nokunoの日記
大規模データが公開されているサイトについて以下のQuoraでid:makimotoさんが質問していました。Data: Where can I get large datasets open to the public? - Quora以下、紹介されているサイトの一覧です。一部有料のものもあるようです。UCI Machine Learning RepositoryPublic Data Sets : Amazon Web ServicesCRAWDADno titleCity of Chicago | Data PortalGovLoop | Social Data Network for Governmentdata.gov.uk | Opening up governmentData.Medicare.GovData.Seattle.Gov | Seattle’s Data SiteOp
fuba 2011/06/16
dataset
リンク
AWS Public Data Set
Amazon is an Equal Opportunity Employer: Minority / Women / Disability / Veteran / Gender Identity / Sexual Orientation / Age.
fuba 2011/06/15
dataset

log
リンク
livedoor Techブログ : livedoor グルメの DataSet を公開
櫛井です。以前 livedoor clip のデータを学術研究用に公開しましたが，おかげさまで，たまに発表等で livedoor clip という名前が引用されているのを見かけるようにもなり感慨深い限りです。さて，今回は第二弾としまして，livedoor グルメのデータをまとめてダウンロード & 利用可能にしようと思います。今回はいろいろと余裕がなかったため豪華なイラスト付きページが用意できませんでした livedoor clip のデータとは違い，定期アップデートはされません。2011年4月22日の時点のデータのみとなります ...が，なにかしら皆様の研究のお役に立てればと思います。よくありそうな質問と答えライブドアグルメのユーザですが，自分の個人情報が公開されちゃうってこと？困ります！公開されるのは，もともとライブドアグルメのサイトで誰でも見れるようになっている情報だけで
fuba 2011/05/20
dataset
リンク
Software Development Company | DEV
The best custom software development companies include the best UX design. The best UX starts with strategic planning. By aligning our digital transf ormation solutions with your vision and goals, we become a true partner, starting at UX/UI design. The best software development firms start their web development projects at the design phase. Whether you are looking for mobile app development or web
fuba 2011/05/12
スクレイピング、クロールしてきたデータを共有、販売

dataset
リンク
放射線モニターデータのまとめページ（トップ）
Not your computer? Use a private browsing window to sign in. Learn more about using Guest mode
fuba 2011/03/30
dataset

radiation

disaster
リンク
Email Datasets: person name disambiguation and threading
1. Personal Name Annotation Due to privacy issues, it is very hard to get hold of large and realistic em ail corpora. Here you can find a couple of em ail datasets, as well as a dataset of news groups text - annotated with personal names spans. The full description of these datasets, including relevant statistics and references, is available in: Einat Minkov, Richard C. Wang & William W. Cohen, Extr
fuba 2011/01/03
dataset

nlp

wsd
リンク
yatsが回収したTwitter日本語圏 9月(1/2) - 不可視点
久しぶりにTwitter日本語圏のダンプを公開したいと思います。 9/1-9/16の1.77億つぶやきのMySQLダンプです(load dataで取り込むタイプ) yatsの収集対象は公開ユーザー状態でつぶやかれたもののうち過去3週間以内につぶやいたユーザーからのもの、累積200〜400つぶやきの日本語ユーザーからのものです。ベストエフォートです。 streaming apiで流れてくるつぶやきもだいたい記録しています。スキーマ： CREATE TABLE `buffer_20100916` ( `id_autoinc` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `id` bigint(20) unsigned NOT NULL, `user` varchar(20) NOT NULL, `content` text NOT NUL
fuba 2010/09/19
dataset

twitter
リンク
N-gram コーパス - 日本語ウェブコーパス 2010
概要ウェブページに出現する形態素 N-gram と文字 N-gram を頻度とともに収録したコーパスです．各 N-gram コーパスには，頻度 10/100/1000 以上の 1-gram から 7-gram までが収録されています． N-gram コーパスの構築においては，Google N-gram コーパスと同様の前処理を施しています．句点・感嘆符・疑問符を文の区切りとして利用しているので，「モーニング娘。」や「Yahoo!」などの固有名詞については，不適切な文の区切りがおこなわれています．また，文の区切りは削除するようになっているため，コーパス中に句点・感嘆符・疑問符は出現しません．形態素 N-gram コーパス，文字 N-gram コーパスともに，文境界マーク（<S>，</S>）は採用していますが，未知語トークン（<UNK>）は採用していません．また，文字 N-gram コーパ
fuba 2010/09/15
corpus

nlp

dataset
リンク
Index of /misc/danbooru/
../ README.txt 17-Sep-2012 18:24 1306 posts.json.gz 17-Sep-2012 18:24 71561413 tags.json.gz 17-Sep-2012 18:24 1292525
fuba 2010/09/08
dataset
リンク
CLEANEVAL
CLEANEVAL: home page CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language techno logy research and development. The first Cleaneval took place (for Chinese and English) over the summer of 2007, with a workshop in Belgium in September (3rd Web as Corpus workshop (WAC3),
fuba 2010/06/30
contentextraction

web

mining

dataset
リンク
What is Twitter, a Social Network or a News Media? - WWW'10
ANLAB Traces What is Twitter, a Social Network or a News Media? (WWW2010) Mining communities in networks: a solution for consistency and its evaluation (IMC2009) Consistent Community Identification in Very Large Networks (MICNET2009) Evaluation of VoIP Quality over WiBro (PAM2008) I Tube, You Tube, Everybody Tubes (IMC2007) Analysis of Topological Characteristics of Huge Online Social Networking S
fuba 2010/03/11
twitter

dataset
リンク
animeface-character-dataset - デー
animeface-character-dataset 僕が使っている，イラストの顔領域サムネイル(160x160)をキャラクターごとにまとめたデータセットです．アニメ顔類似検索でも顔領域画像から特徴ベクトルに変換する関数を学習する際に使用しています．（表示しているのは別のデータです）こんなサイムネイルが150〜200キャラクター分あります．２ちゃんねるの有志が勝手に作成したもの〜くらいに思っていただければ．
fuba 2010/02/14
cv

animeface

dataset
リンク
animeface-character-dataset
アニメやゲームなどのイラストの顔領域サムネイル(160x160)をキャラクターごとにまとめたデータセットです．
fuba 2010/02/14
cv

animeface

dataset
リンク
http://twitter.com/penguinana/status/1183139212
fuba 2009/02/06
twitter socialgraph api dump 日本人ユーザのみ

twitter

dataset

socialgraph
リンク
MNIST handwritten digit database, Yann LeCun and Corinna Cortes
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spendin
fuba 2008/12/26
ml

dataset
リンク
1