Welcome to NLP-Twitter 👋

License: MIT

Twitter_spider for China.


👤 h4m5t

About The Project

Introduce some ways to crawl Tweets for China Students so that they can do Scientific research or course projects.





❌ 数据读入和输出保存(CSV型、SQL型)

❌ 多线程爬取


  • 1. 借助第三方爬推特库

    • https://github.com/twintproject/twint

    • https://github.com/bisguzar/twitter-scraper

    • https://github.com/jonbakerfish/TweetScraper


      WARNING:root:Error retrieving https://twitter.com/: Timeout(ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000023F9CCDFF10>, 'Connection to twitter.com timed out. (connect timeout=10)')), retrying


      • 使用VPN(ExpressVPN\NordVPN)(缺点:比较贵)






      • 对本机设置全局代理(已尝试,不太可行)

      • 在IDE设置代理(已尝试,不太可行)

      • 设置twint的config(已尝试,不太可行)


        config.Proxy_host = ''

        config.Proxy_port = 7890

        config.Proxy_type = "socks5"

      • 把脚本放在国外VPS上(可行)

        • digitalocean

        • vultr

        • hostwinds

        • Linode

        • 搬瓦工

  • 2.使用Twitter-developer-API

    The Twitter APl enables programmatic access toTwitter in unique and advanced ways.Use it to analyze, learn from, and interact with Tweets,Direct Messages, users, and other key Twitter resources.


  • 3.借助第三方爬虫库

    • Scrapy
    • requests
    • urllib
    • BeautifulSoup
  • 4.借助数据采集器

  • 5.selenium模拟浏览器操作


    • 安装chromedriver

      打开chrome,地址栏输入chrome://version 查看浏览器版本,安装对应版本的chromedriver

    • 控制下拉、翻页等操作,要设置相应的延迟

    • 推特对不同IP有不同的限制策略,有些地区需要登陆才可看见推文,有些不用。

    • 如果需要导入浏览器数据,使用webdriver之前需要关闭chrome,防止user_data被占用


  • requests
  • selenium
  • twint
  • csv
  • time
  • datetime
  • urllib


  1. generate_url.py 根据用户ID生成对应的url,保存在url.txt

  2. test*.py 用来测试相关爬虫库、代理设置、模拟浏览器操作

  3. Twitter.csv 为100个涉华人员的相关信息

  4. user_info.py 爬取关注者被关注者数量

  5. user_tweets.py 爬取对应用户的推文


Building standard queries

The best way to build a standard query and test if it’s valid and will return matched Tweets is to first try it at twitter.com/search. As you get a satisfactory result set, the URL loaded in the browser will contain the proper query syntax that can be reused in the standard search API endpoint. Here’s an example:

  1. We want to search for Tweets referencing @TwitterDev account. First, we run the search on twitter.com/search
  2. Check and copy the URL loaded. In this case, we got: https://twitter.com/search?q=%40twitterdev
  3. Replace https://twitter.com/search with https://api.twitter.com/1.1/search/tweets.json and you will get: https://api.twitter.com/1.1/search/tweets.json?q=%40twitterdev
  4. Run a Twurl command to execute the search.

Please note that the API requires that the request be authenticated (check Authentication & Authorization documentation for more details on this). Note that the standard search API only serves data from the last week. If you need historical data odler than seven days, check out the premium and enterprise search APIs.

Standard search operators

Operator Finds Tweets...
watching now containing both “watching” and “now”. This is the default operator.
“happy hour” containing the exact phrase “happy hour”.
love OR hate containing either “love” or “hate” (or both).
beer -root containing “beer” but not “root”.
#haiku containing the hashtag “haiku”.
from:interior sent from Twitter account “interior”.
list:NASA/astronauts-in-space-now sent from a Twitter account in the NASA list astronauts-in-space-now
to:NASA a Tweet authored in reply to Twitter account “NASA”.
@NASA mentioning Twitter account “NASA”.
politics filter:safe containing “politics” with Tweets marked as potentially sensitive removed.
puppy filter:media containing “puppy” and an image or video.
puppy -filter:retweets containing “puppy”, filtering out retweets
puppy filter:native_video containing “puppy” and an uploaded video, Amplify video, Periscope, or Vine.
puppy filter:periscope containing “puppy” and a Periscope video URL.
puppy filter:vine containing “puppy” and a Vine.
puppy filter:images containing “puppy” and links identified as photos, including third parties such as Instagram.
puppy filter:twimg containing “puppy” and a pic.twitter.com link representing one or more photos.
hilarious filter:links containing “hilarious” and linking to URL.
puppy url:amazon containing “puppy” and a URL with the word “amazon” anywhere within it.
superhero since:2015-12-21 containing “superhero” and sent since date “2015-12-21” (year-month-day).
puppy until:2015-12-21 containing “puppy” and sent before the date “2015-12-21”.
movie -scary :) containing “movie”, but not “scary”, and with a positive attitude.
flight :( containing “flight” and with a negative attitude.
traffic ? containing “traffic” and asking a question.

注意:需要使用URL encode(可使用在线网站进行转换)








