Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.11531v1 [q-fin.ST] 14 Dec 2023

The irruption of cryptocurrencies into Twitter cashtags: a classifying solution

Ana Fernández Vilas
AtlantTTic Research Center, University of Vigo, Spain. Email: avilas@det.uvigo.es
   Rebeca Díaz Redondo
AtlantTTic Research Center, University of Vigo, Spain. Email: rebeca@det.uvigo.es
   Antón Lorenzo García AtlantTTic Research Center, University of Vigo, Spain
Abstract

There is a consensus about the good sensing characteristics of Twitter to mine and uncover knowledge in financial markets, being considered a relevant feeder for taking decisions about buying or holding stock shares and even for detecting stock manipulation. Although Twitter hashtags allow to aggregate topic-related content, a specific mechanism for financial information also exists: Cashtag (consisting of the company ticker preceded by $) is a supporting mechanism to track financial tweets referring to a company listed in a stock market. However, according to our experiments and due to the lack of conventions in cashtags usage, the irruption of cryptocurrencies has resulted in a significant degradation on the cashtag-based aggregation of posts. Unfortunately, Twitter’ users may use homonym tickers to refer to cryptocurrencies and to companies in stock markets, which means that filtering by cashtag may result on both posts referring to stock companies and cryptocurrencies. This research proposes automated classifiers to distinguish conflicting cashtags and, so, their container tweets by analyzing the distinctive features of tweets referring to stock companies and cryptocurrencies. As experiment, this paper analyses the interference between cryptocurrencies and company tickers in the London Stock Exchange (LSE), specifically, companies in the main and alternative market indices FTSE-100 and AIM-100. Heuristic-based as well as supervised classifiers are proposed and their advantages and drawbacks, including their ability to self-adapt to Twitter usage changes, are discussed. The experiment confirms a significant distortion in collected data when colliding or homonym cashtags exist, i.e., the same $ acronym to refer to company tickers and cryptocurrencies. According to our results, the distinctive features of posts including cryptocurrencies or company tickers support accurate classification of colliding tweets (homonym cashtags) and Independent Models, as the most detached classifiers from training data, have the potential to be trans-applicability (in different stock markets) while retaining performance.

Index Terms:
AIM-100, Cashtags, Cryptocurrencies, Data Analysis, FTSE-100, London Stock Exchange, Support Vector Machines, Twitter

I Introduction

The increasingly irruption of information & telecommunications technologies in the stock markets have been accompanied by business growth. The flood of information can support decisions taken by brokers as well as individual investors which harvest information about the situation of a company, clients’ opinions, socio-economical changes, political decisions, rumors, etc. Moreover, the ubiquity of online social media has caught also companies, brokers and other key roles in the financial market which have begun to share more and more financial information and expert opinions on stock exchanges. Although all this information might turn social media into one of the main sources -if not the main one- for decision-making due to its real-time nature, success in stock trade depends not only in quick access to information but also on the quality of this information.

Currently, Twitter is one of the most used platforms to share financial information from companies, brokers, news agencies or individual investors. Above other financial information sources like message boards or discussion forums, stock microblogging exhibit three distinctive characteristics [Sprenger et al.(2014)Sprenger, Tumasjan, Sandner and Welpe]: (1) Twitter’s public timeline may capture the natural market conversation more accurately and reflect up to date developments; (2) Twitter supports a more ticker-like live conversation, which allows twitter-microbloggers to be exposed to the most recent information of all stocks and does not require users to actively enter the forum for a particular stock; and (3) twitter-microbloggers should have a stronger incentive to publish valuable information in order to maintain reputation (increase mentions, the rate of retweets and their followers), while financial bloggers can be indifferent to their reputation in the forum. The combinations of this stream of information with suitable processing and analysis techniques would support the action for many financial stakeholders and even law enforcement agencies.

The main Twitter sharing mechanism to track financial information is the cashtag: a clickable term consisting in a company ticker preceded by $ symbol. Remember that a company ticker is a short sequence of letters and sometimes numbers, that identifies uniquely a company in a specific stock market. For example, in the case of Vodafone, its ticker in LSE (London Stock Exchange) is VOD so that its cashtags would be $VOD. This cashtag is included in the tweet’s text similarly to what happens with hashtags. It is supposed that the usage of $VOD marks the tweet as a post containing financial information about Vodafone. Also similarly to hashtags, Twitter supports tracking tweets that contain a specific cashtag. All of this turns cashtags into one of the most useful mechanisms to easily harvest financial information on Twitter. However, the irruption of cryptocurrencies has degraded the accuracy and so the quality of the information obtained through cashtags because some cryptocurrencies acronyms are equal to company tickers in stock markets (acronym conflict), largely due to the huge quantity of cryptocurrencies and the lack of a cryptocurrencies’ regulated market. As a result, when a conflicting or colliding cashtag is used for tracking or searching, results referring to both stock companies and cryptocurrencies might be retrieved. Moreover, (i) the total amount of cryptocurrency-related tweets extensively surpasses company-related tweets (as a consequence of the increasing popularity of cryptocurrencies) and (ii) cryptocurrency tweets are considered low quality since most of them are spam or auto-generated messages. All of this produces a very significant degradation in the tracking capacity of cashtags and underlines the need of disambiguation mechanisms to distinguish both groups of tweets.

This paper aims to highlight the conflicting issue in cashtags as well as proposing classifiers to distinguish cryptocurrency and company cashtags with enough accuracy. To show the challenge and introduce the classifiers, the research work was deployed over real data retrieved from Twitter and related to the London Stock Exchange (LSE). The experiment in this paper focus on cashtag conflicts in LSE companies, specifically, the 100 companies listed in the main market index FTSE-100 and the 100 companies listed in the alternative market index AIM-100, both indices during the period July 1, 2017 and February 15, 2018. Hereinafter, we refer as LSE-100 to both indices.

This paper is structured as follows. After introducing the related work and motivation in Section II, the motivation, the experimental dataset and the exploratory analysis is described in Section III, where the impact of cryptocurrency tweets on the LSE-100 tweets is quantified. The datasets and a bird-eye description of the deployed methodology are introduced in Sections IV and V. Then, exploratory analysis, where distinctive characteristics are uncovered for LSE-100 tweets (tweets that only refer to a company listed in LSE-100) and cryptocurrency-tweets (tweets that only refer to a cryptocurrency) in a process of feature extraction is detailed in Section VI. From the mentioned methodology, the proposed classifying systems, which solve the problem of colliding cashtags, are progressively introduced: Word-based Heuristic Filters (Section VII); SVM (Support Vector Machine) Classifiers (Section VIII); Combined Classifiers (Section IX); LSTM (Long short-term memory) network classifiers (Section X); and Logistic-regression-based Classifiers (Section XI). The different advantages and drawbacks of the proposed classifiers, as well as their ability to self-adapt to Twitter usage changes, are discussed in Section XII and, given the generalizable features of the independent versions of the classifiers, Section XIII applies a statistical test to evaluate if the different performances are due to a difference in the models. Finally, the main conclusions and future work are summarized in Section XIV.

II Related Work

The modernization and digitalization of the financial market has been accompanied by the remarkable increasing of online information available for both brokers and individual investors, especially in social media, which bring up a huge source of knowledge which may be applied to analysis and even predict financial movements and so to assist in taking decisions in financial markets. Several works have been accomplished regarding the predictive power of the financial information in social media. [Bordino et al.(2012)Bordino, Battiston, Caldarelli, Cristelli, Ukkonen and Weber] show that trading volumes of stocks listed in NASDAQ-100 are correlated with their query volumes, the number requests submitted on the Internet, and [Wang et al.(2012)Wang, Huang and Wang] proposed sentiment analysis as one of the most relevant features to improve the accuracy of financial time-series forecasting. More recently, [Cavalcante et al.(2016)Cavalcante, Brasileiro, Souza, Nobrega and Oliveira] and [Li et al.(2017)Li, Dai, Park and Park] raised similar ideas such as the importance of mining textual context and sentiment analysis of professional opinions in social media and financial news as useful supplementary sources for forecasts; and [Pai and Liu(2018)] applies regression and time series models with social media data (Twitter) and stock market values to predict monthly total vehicles sales.

In addition, several intelligent trading systems have been proposed, such as [Gunduz and Cataltepe(2015)] that deployed a forecasting method which combines the analysis of news from Turkish finance websites, the extraction of feature vectors and the stock prices to predict future market movements; or [Nassirtoussi et al.(2015)Nassirtoussi, Aghabozorgi, Wah and Ngo] which applied text mining techniques to financial news-headlines and predict movements in the FOREX market. Deep learning has been also applied to model both short-term and long-term influences of events on stock price movements in [Ding et al.(2015)Ding, Zhang, Liu and Duan]. Finally, in [Ranco et al.(2016)Ranco, Bordino, Bormetti, Caldarelli, Lillo and Treccani] the combination of public news with the browsing activity of the users of Yahoo! Finance to forecast intra-day and daily price changes of a set of 100 highly capitalized US stocks was explored. To sum up, most of the research in this field highlight the predictive power of social media, especially combined with others information sources. In [Zhang et al.(2018)Zhang, Qu, Huang, Fang and Yu], the idea that stock markets are impacted by various factors (trading volume, news events and the investors’ emotions) is supported via multi-source multiple instance learning applied to a 3-part dataset: historical quantitative data; Web news articles; and investor social network posts (all located in China).

If the impact of information from online data sources into the financial market is widely acknowledged by researchers and professionals, there is also a huge consensus about Twitter specifically. Besides the well-known hashtag, that allows people to follow topics they are interested in, Twitter unveiled a new clicking and tracking feature for tickers (companies’ stock symbols) known as cashtags which are, as explained before. Cashtags allow tracking financial information about a specific company or market. Related to this mechanism, [Hentschel and Alonso(2014)] reported an exploratory analysis of public tweets in English which contain at least one cashtag from NASDAQ (National Association of Securities Dealers Automated Quotation) or NYSE (New York Stock Exchange). The research concludes that the use of cashtag is higher in the technologic sector, which seems to be related to the technological profile of most of the Twitter users. It also highlights the existence of relevant information behind the co-occurrence of cashtags and the co-occurrence of cashtags and hashtags together.

It deserves to be mentioned that according to [Dredze et al.(2016)Dredze, Kambadur, Kazantsev, Mann and Osborne] there are mainly five types of users that post financial information in Twitter (journalist, companies and their representatives, investors, government agencies and citizen journalists) and that, contrary to the classic information sources -consisting mainly of breaking news-, Twitter financial information also includes rumors and speculations. In [Ceccarelli et al.(2016)Ceccarelli, Nidito and Osborne], authors agree on the usefulness of Twitter as a source of financial information and in its complementary value to traditional information sources, but they suggest that, regarding Twitter popularity, is is not necessarily the same for financial contributors as for contributor in other areas, so that, the importance of novelty and popularity is higher in financial tweets. Moreover, in [Elliott et al.(2017)Elliott, Grant and Rennekamp], it is shown that the impact of negative financial news over investors depends also on its origin. When the negative news comes from the Twitter account of the Investor Relations Office, the investors’ willingness to invest highly decrease but when the negative news come from the CEO’s Twitter account, they have no effect on it.

[Rao and Srivastava(2014)] studied the relationship between Twitter sentiment and financial market measures like volatility, trading volume, etc. with promising results in Dow Jones Industrial Average (DJIA) and NASDAQ-100 indices for high-frequency trades. This type of algorithmic trading –high speed, high turn over rate – demands real-time financial data and electronic trading tools, and especially social media could be integrated as a fast mechanism to capture the public behavior and opinion to take decisions. More recently, in [Fernández Vilas et al.(2019)Fernández Vilas, Díaz Redondo, Crockett, Owda and Evans] it is analyzed the variability on posts’ volume, content, sentiment and geographical provenance after a far-impacting financial event and concludes that although Twitter is not a specific-purpose financial forum, it is highly permeable to financial events. Thus, post’s sentiment changes with the considered financial event so that Twitter activity can be a good predictor or sign of the stock market state. However, not everything related to Twitter is good news. According to [Dredze et al.(2016)Dredze, Kambadur, Kazantsev, Mann and Osborne], the use of Twitter adds new challenges like the huge volume of available data, the high level of repetition of the same information or the quality of the tweets.

With all the above, the relationship between Twitter behavior and stock share price is, by far, the most studied scenario, especially in relevant moments as quarterly announcements. In [Ranco et al.(2015)Ranco, Aleksovski, Caldarelli, Grčar and Mozetič] a 15-month period of Twitter data about 30 stock companies of the DJIA index was investigated and, according to the results, it can be stated that not only is there a strong correlation between Twitter behavior and stock share price in well known relevant moments, but also there are correlation peaks which do not correspond to any expected news about the stock market. Moreover, in [Liu et al.(2015)Liu, Wu, Li and Li], Twitter is used to identify and predict stock co-movement from firm-specific social media metrics. Aside from causality between Twitter and stock prices, in [Shutes et al.(2016)Shutes, McGrath, Lis and Riegler] US market tweets are studied as signs of new information in the stock market and the experiment shows that nearly a third of the tweets are linked to abnormal price movements. However, the lack of information during regular periods makes difficult that Twitter completely replace traditional information sources for the financial market.

Leaving apart the relation between Twitter and financial markets, other researches have studied the predictive value of the information extracted from Twitter to take trading decisions. In [Ruiz et al.(2012)Ruiz, Hristidis, Castillo, Gionis and Jaimes] the correlation between Twitter activity and financial time-series showed that stock share prices are weakly correlated with the analyzed Twitter features if they are used alone. In addition, [Cazzoli et al.(2016)Cazzoli, Sharma, Treccani and Lillo], analyzed over 1700 listed companies for more than two years. Apart from the importance of obtaining a huge financial tweet dataset, the authors found out that expert users impact the financial market more than others and that technology and consumers show a better correlation than other sectors.

Even though most of the research work focused on Twitter data volume, as the ones previously introduced, some studies also apply sentiment analysis to distinguish the polarity of Twitter content and its impact on the financial market. [Nassirtoussi et al.(2011)Nassirtoussi, Aghabozorgi, Wah and Ngo] showed that public mood analyzed through Twitter feeds is correlated with the DJIA. Also, [Zhang(2013)], found out a high negative correlation between mood states like hope, fear and worry in tweets and the DJIA. Furthermore, in [Al Nasseri et al.(2014)Al Nasseri, Tucker and de Cesare], [Liew and Budavári(2016)], [Rajesh and Gandy(2016)], [Cortez et al.(2016)Cortez, Oliveira and Ferreira] and [Nguyen et al.(2015)Nguyen, Shirai and Velcin] sentiment analysis was considered useful to make trading decisions or predict stock market variables. More recently, [Pagolu et al.(2016)Pagolu, Reddy, Panda and Majhi] applied sentiment analysis and unsupervised machine learning to analyze the correlation between stock market movements of a company and sentiments in tweets, finding out a strong correlation between the rises and falls in stock prices of a company and public opinions or emotions about that company expressed on Twitter. Also, [Dickinson and Hu(2015)] investigated the Pearson correlation of public sentiment with stock increases and decreases. Also, [Oliveira et al.(2016)Oliveira, Cortez and Areal, Oliveira et al.(2017)Oliveira, Cortez and Areal] proposed stock market lexicons to deal with the short length of tweets, one of the main issues of natural language techniques when they are applied to tweets. Then, they studied the correlation between investors’ sentiment indicators and two traditional survey-based indicators –II (Investors Intelligence) and AAII (American Association of Individual Investors) with moderate correlation results.

To sum up, there is a consensus of the good sensing and novelty characteristics of Twitter as a source of information for the financial market, especially if it is combined with other information sources. As most of the current research is focused on the predictive power of Twitter and on their capability to support decision making, now, it is especially important to recover information with enough quality to support these foreseen expert systems for the financial market. At this respect, this paper aims to support quality in financial data retrieving. This research work highlights the negative effect of the popularity of cryptocurrencies in the sensing capability of Twitter, and specifically on the efficiency of cashtags as a tracking mechanism for financial information due to, as mentioned before, the usage of homonyms cashtags to refer to both company tickers and cryptocurrencies.

III Motivation

Since 2012, Twitter incorporates cashtag as a mechanism to find and track tweets that address companies by their tickets in a specific stock market. However its usefulness has been deteriorated due to the interference of cryptocurrencies. Although cryptocurrencies have existed for a long time, they became remarkably popular at late 2017 as it is shown in Figures 1(a), 1(b) and 1(c) which represent the volume of Google searches pro the term "cryptocurrency" and two specific ones Nxt and Stellar Lumens. In the same period, the volume of Google searches about specific stock companies that have remain constant, Figure 1(d).

Refer to caption
(a) Cryptocurrency (general term) Searches on Google trend evolution
Refer to caption
(b) NXT (ticker of Nxt platform, cryptocurrency) Searches on Google trend evolution
Refer to caption
(c) XLM (ticker of Stellar Lumens, cryptocurrency) Searches on Google trend evolution
Refer to caption
(d) VOD (ticker of Vodafone company in LSE) Searches on Google trend evolution
Figure 1: Searches on Google trend evolution late 2017

This change in behavior is also visible on Twitter, where the number of daily results about cryptocurrencies has increased by more than 40 times, according to our analysis. Although these tweets should not interfere with cashtag mechanisms, many of them use the dollar symbol $ followed by the acronym of the cryptocurrency to indicate that the tweet refers to it. The conflict arises when the ticker of some company and the acronym of a certain cryptocurrency match, what is a natural consequence of the huge number of cryptocurrencies that have emerged in a short time. As a result, it may happen that, at recovering tweets with a specific cashtag, most of them do not refer to the company they should identify, but they address the coincident cryptocurrency instead. As an example, this is the case for $XLM (XLMedia company vs Stellar Lumens cryptocurrency) and $NXT (Next plc company vs Nxt platform cryptocurrency). We refer to these colliding cashtags as homonym cashtags and the tweets that contain at least one of them homonym tweets.

As mentioned, this paper studied the negative effect of homonym cashtags by using LSE-100 as study stock exchange. So, a homonym cashtag is any cashtag that can refer to both an LSE (LSE-100, restricted to companies in FTSE-100 and AIM-100) company and a cryptocurrency, because both have the same acronym and a homonym tweet is any tweet that has at least one homonym cashtag. The list of homonym tickers in LSE-100 can be seen in Table I. These tickers were identified manually, looking for coincident cryptocurrencies for each constituent company. On the other hand, we will call non-homonym cashtag to any cashtag that only refers to a regulated stock market company or to a cryptocurrency, because there are not two of them with the same ticker, and non-homonym tweet to any tweet that has at least one cashtag from an LSE-100 company, as long as none of its tickers is included in the list of homonym cashtags. We consider a company tweet any tweet that contains at least one cashtag that refers to a company in a stock market and cryptocurrency tweet to any tweet that contains at least one cashtag that refers to a cryptocurrency.

Refer to caption
Figure 2: FTSE-100 and AIM-100 tweets: time distribution (July 2017-February 2018)

Homonym cashtags

LSE-100 company(market)

Cryptocurrency

$NXT

Next plc (FTSE-100)

Nxt (coin and platform)

$SKY

SKY plc (FTSE-100)

Skycoin

$XLM

XLMEDIA (AIM-100)

Stellar

$BRK

BROOKS (AIM-100)

Breakout coin

$GBG

GB group (AIM-100)

Golos Gold

$APH

Alliance pharma (AIM-100)

Aphroditecoin

$AMS

Advanced medical solutions (AIM-100)

Amsterdamcoin

$CRW

Craneware (AIM-100)

Crown

TABLE I: Homonym tickers in LSE (LSE-100, restricted to FTSE-100 and AIM-100)

The number of tweets that contain a FTSE-100 or an AIM-100 cashtag sharply increased at late 2017 (Figure 2). However, most of the tweets do not refer to LSE-100 companies. Remember that FTSE-100 index lists the one hundred most valuable companies such as Vodafone, Cocacola or RioTinto, while the AIM-100 lists the one hundred most valuable companies in the secondary market, these companies (eg. Alliance Pharma, Hutchison China or Stafflineare) less known than the main market companies. This interference is even more impacting if we take into account the disparate number of results obtained. While looking at FTSE-100 non-homonym tickers we have up to 1,000 results daily, the number of daily results that refer to the XLM and NXT (Cryptocurrency-colliding tickets) are more than 10,000 (Figure 3).
Apart from that, as it is shown in Figure 3, the interference of homonym tweets skyrocketed more than 30 times since October 2017. From this period the recovered homonym tweets made up practically all the results obtained. In fact, the number of recovered homonym tickers in December are 5.6 times the amount collected for the non-homonym tweets for the FTSE-100 market and up to 40 times for the AIM-100.

Refer to caption
Figure 3: Daily FTSE-100 and AIM-100 tweet time distribution, Homonym(black) vs No Homonym(blue) (July 2017-February 2018)

To explore the interference of cryptocurrencies in regulated market, all the homonyms tweets were classified manually as cryptocurrency tweet or LSE-100 tweet, taking into account the content, the user characteristics and history, etc. The results of the annotation are shown in Figure 4, where we can see that the increase in homonym tweets is mostly localized in cryptocurrency tweets, while the number of LSE-100 tweets remains constant, with a ratio 100:1 at the beginning of 2018.

Refer to caption
Figure 4: Homonym tweets time distribution, LSE-100(blue) vs Cryptocurrency(black) (July 2017-February 2018)

This large number of cryptocurrency tweets underline the need of support methods to properly retrieve information regarding stock exchanges. Although the situation varies slightly from one homonym cashtag to another, there is a clear sign of the difficulty to track the stock exchange via Twitter just by cashtags. In addition, almost all the tweets about cryptocurrencies are spam or auto-generated by applications. For this reason, the informative purpose of the cashtag is almost lost, so some disambiguation mechanisms need to be developed.

IV Datasets

To carry out this paper, three different datasets have been used according to wether they include or not a set of cashtags we selected to have a representative set of non-homonym cryptocurrency cashtags, non-homonym LSE-100 cashtags (FTSE-100, AIM-100) and homonym cashtags:

  • Non-homonym Cryptocurrencies tweets: a set of tweets that contains at least one of the cashtags of non-homonym cryptocurrencies, that is, no coincident with the tickers in FTSE-100 and AIM-100.

  • Non-homonym LSE-100 tweets: a set of tweets that have at least one cashtag from an LSE-100 company, as long as it does not contain and homonym cashtag, respectively split into two subsets (FTSE-100 and AIM-100). Keep in mind that a tweet can be in both subsets if it has a cashtag from FTSE-100 and a cashtag from AIM-100. Likewise, if the tweet also has a non-homonym cryptocurrency cashtag, the tweet could belong to Cryptocurrencies non-homonym tweets.

  • Homonym tweets: a set of tweets that contains at least one homonym cashtag which refers both to a company and to a cryptocurrency. Also, it can be split into two subsets, FTSE-100 and AIM-100 according to the list in which the colliding cashtag is included.

Cryptocurrencies

$SNT, $ADA, $MTH, $ADX, $LSK, $DSR, $ARK, $CLOAK, $TKN, $DLC, $DCR, $KMD, $IQT, $ZCL, $DCY, $ALIS, $RBY, $SYS, $EXP, $BCY, $VEN, $BCN, $BLITZ, $UGT, $GVT, $MONA, $QASH, $DASH, $AUR, $UNO, $BURST, $REQ, $PART, $TRIG, $GCR, $LMC, $XEM, $BNB, $SNGLS, $BITSILVER, $PDC, $ELIX, $XVG, $DOPE, $LEND, $SNRG, $NLG, $ARDR, $QSP, $SALT, $SYNX, $GRC, $XDN, $PIVX, $DCT, $WAVES, $PTOY, $SIB, $LTC, $CPC, $NAS, $XMR, $LOCI, $ION, $VSX, $NXS, $XMY, $GBYTE, $XMG, $BAT, $IOP, $HMQ, $NTCC, $PKB, $BAY, $PBL, $BYC, $MINT, $HSR, $MUSIC, $XSPEC, $IGNIS, $ETP, $BWK, $FCT, $DRGN, $MUE, $XPM, $STEEM, $FTC, $SPHR, $DGB, $DGD, $SUB, $VOX, $MAID, $RPX, $AEON, $XAUR, $MIOTA, $CRC, $BET, $ENG, $XVJ, $POWR, $STORJ, $GUP, $UBQ, $SBD, $INFX, $LGD, $DYN, $INFR, $ONION, $MANA, $SLR, $FUN, $CURE, $BITB, $EMC2, $XZC, $IOTA, $COVAL, $AGRS, $PASC, $DOGE, $XRB, $SWT, $FLDC, $ZEC, $NBT, $XRP, $ETH, $RADS, $ETC, $PANGEA, $CLAM, $PHR, $APX, $BTC, $NEM, $NEO, $MYST, $START, $ENJ, $WTC, $PPT, $STR, $ARDOR, $ITZ, $BCPT, $ITC, $TAAS, $STRAT, $SEQ, $EDG

FTSE-100

$CPI, $DC., $HIK, $INTU, $SN., $CPG, $CCL, $BARC, $CCH, $GSK, $BDEV, $DCC, $BLND, $RIO, $WTB, $SMIN, $IAG, $MRW, $SVT, $III, $ITRK, $AHT, $JMAT, $IHG, $LGEN, $HL., $AV., $BATS, $STAN, $CRH, $LSE, $RTO, $SGRO, $SBRY, $CRDA, $SHP, $DLG, $BLT, $PSON, $GKN, $GLEN, $PSN, $NG., $SSE, $INF, $SMT, $BNZL, $UU., $MERL, $REL, $PRU, $LAND, $FERG, $DGE, $MDC, $WPP, $MCRO, $EXPN, $WPG, $RRS, $VOD, $RMG, $RR., $IMB, $RDSB, $RDSA, $HMSO, $FRES, $ADM, $TSCO, $PFG, $HSBA, $SKG, $OML, $TUI, $ITV, $MKS, $ULVR, $AZN, $AAL, $BT.A, $BAB, $PPB, $BRBY, $MNDI, $RB., $SL., $LLOY, $SGE, $ABF, $RBS, $STJ, $ANTO, $CNA, $SDR, $GFS, $TW., $RSA, $BP., $CTEC, $EZJ, $KGF, $BA.

AIM-100

$OPG, $SQS, $PAF, $BOO, $GHH, $TAP, $MANX, $SAA, $VNL, $KWS, $IOM, $PLUS, $HZD, $ARBB, $BNN, $IPEL, $CVSG, $SFE, $OCI, $CRS, $DTG, $STAF, $FDP, $ABC, $XSG, $SCH, $BUR, $BMK, $APGN, $TCM, $HUR, $NFC, $IGR, $SOLG, $YNGN, $FOG, $QXT, $REDD, $YNGA, $HCM, $MPE, $BREE, $FEVR, $RNWH, $RWS, $WINE, $POLR, $DOTD, $SMS, $TEF, $GAMA, $CLIN, $MUL, $CTH, $TMO, $JSG, $CAM, $ASY, $QTX, $ASC, $NUM, $CAKE, $EMIS, $LTG, $SMTG, $MAB1, $VTU, $JHD, $CVR, $PRSM, $IQE, $AMER, $EAH, $WJG, $ACSO, $SOU,$CAML, $JOUL, $PANR, $RTHM, $FPM, $MTW, $VCP, $HGM, $DDDD, $SLE, $YOU, $PURP, $IDOX, $OGN, $HOTC, $NICL, $RST, $MIDW, $TFW, $SCPA, $SPH

TABLE II: Non-homonym Cashtags for the datasets

In Table II the three cashtag sets are shown. The set of Non-homonym Cryptocurrencies was obtained from web sites devoted to tracking cryptocurrencies. The set of Homonym tickers is shown in Table I and it is the main scenario analyzed in this paper: the incidence of new tweets related to cryptocurrencies on the tweets referring to a company in a stock exchange (LSE-100 in our case). As mentioned, tweets were manually annotated (cryptocurrency or company) by expert considering their content, the user’s characteristics and any additional information added to the tweet by a hyperlink were carefully analyzed. A schematic description of the datasets is shown in Table III.

Name

Data interval

Description

Number of results

Cryptocurrencies non-homonym tweets CNHDS

From 15 Jan 2018, to 15 Feb 2018

Tweets that contain a cashtag of one of the main cryptocurrencies

1,023,232

FTSE-100 homonym tweets FTHDS

From 1 Jul 2017, to 15 Feb 2018

Tweets that contain a cashtag of companies of the FTSE-100 coincident with a cryptocurrency

292,864

FTSE-100 non-homonym tweets FTNHDS

From 1 Jul 2017, to 15 Feb 2018

Tweets that contain a cashtag of companies of the FTSE-100 that do not coincide with a cryptocurrency

144,787

AIM-100 homonym tweets AMHDS

From 1 Jul 2017, to 15 Feb 2018

Tweets that contain a cashtag of companies of the AIM-100 coincident with a cryptocurrency

405,625

AIM-100 non-homonym tweets AMNHDS

From 1 Jul 2017, to 15 Feb 2018

Tweets that contain a cashtag of companies of the AIM-100 that do not coincide with a cryptocurrency

69,138

TABLE III: Overview of the datasets

IV-A Tweet structure

For the aim of the exploratory analysis, the information in a tweet can be divided into three main blocks: (i) general information about the tweet such as the ID, the language, the number of retweets and favorites 111It must be mentioned that the tweets were captured as soon as they were posted, so the values of retweets as favorites are 0. and especially the body of the tweet (ii) geolocation where the tweet was sent from and (iii) user information such as name, description, followers, friends, number of favorite tweets, number of retweets, account location, language, if it is a verified account, and graphic representation of the account.

V Methodology

This paper proposes the application of a reasoning methodology based on prior annotated examples, similar to CBR (Case-Based Reasoning [Yan Li et al.(2006)Yan Li, Shiu and Pal]). Classifiers based on CBR determine whether or not an object is a member of a class according to the examples in the base case. The extraction of a formal domain model of the cashtag usage on Twitter is not required, so these classifiers requires less effort in knowledge acquisition when they are compared with rule-based systems, for instance. On building a CBR classifier, first, feature selection and extraction are applied at the base case to remove non-informative features while preserving informative ones (features are extracted from Twitter user activity and posts content on the base case). Second, those features are applied to a classifier that has been trained offline using machine learning techniques. The specific methodology is summarized in Figure 5 where, after obtaining the dataset and pre-processing it, manual annotation and feature identification is carried out. The former, manual annotation, to obtain heuristic-based classifiers which distinguish cryptocurrency and financial tweets in regulated markets; and the latter, feature identification, to decide about the informativeness of observational variables in the application of machine learning classifiers. Although we are combining traditional feature management and machine learning, the main contribution of our work is applying them for disambiguation, that is, constructing a classifier to disambiguate in terms of features. Also the application field, in the context of the irruption of cryptocurrencies, can be considered a novel contribution.

CNHDS was used to extract the common features of cryptocurrencies tweets while FTNHDS and AMNHDS were used to extract the common features of company tweets, so that, they are not influenced by the issue of homonym tickers. In particular, the considered features were (i) pots content (preprocessed regarding punctuation symbols, stop words, emoticons and URLs were not considered, also lowercasing and stemming were applied): most common terms, most common hashtags, or number of tickers ; (ii) user fields: number of followers, number of favorites, the date for account creation, the date for edition of the default interface, a small description of the account, etc.; and (iii) place and time of the post (weekday, day time or geolocation).

We studied the performance of all the proposed classifiers and also their combined performance by designing Combined Systems which use the results from heuristic filtering as an independent variable for the supervised classifiers. Finally, in a more complex approach, we also studied classification performance by using recurrent neural networks (LSTM - Long Short Tern Memory) networks, which trains an embedding matrix for the most common terms of the dataset, in this case, the 10,000 most common ones, and collects the relative importance of each term and the relationship between terms.

Refer to caption
Figure 5: Block Diagram for the proposed method

To properly evaluate the proposed classification techniques, the homonym dataset has been randomly split into three subsets: train set (70%) to deploy classifiers; test set (30%) to measure their performance, and tune set (10% of the train set) to adjust the parameters of the classifiers. We used different performance measures: precision, recall, specificity, accuracy, F-score and AUC (Area Under the Curve). For instance, recall – without totally neglecting precision – is the key measure for heuristic filtering. However, the F-score and the AUC were the focus for supervised classifiers and combined systems. In addition to performance, complexity, estimated useful lifespan, updating tasks and usage scope were also evaluated.

VI Tweet features

To identify the defining features from the general information in tweets, FTNHDS and AMNHDS were used as reference to company tweets, and CNHDS as reference for cryptocurrency tweets. Secondly, a user dataset was also built up to take into account the characteristics (number of followers, number of favorites, date of creation of the account or the edition of the default interface, among others.) of the user who posts, both for cryptocurrency and company tweets. Also, as each user can have a small description on the account, common terms in these descriptions have been also pre-processed (similarly to tweet content) and considered part of the user profile. Thirdly, place (when available) and time was also considered. In the following sections, although all the tweet fields were observed, only those features which are distinctive according to the type of tweet (company or cryptocurrency) are described.

VI-A Content-based features

Most common terms is the most distinctive feature we can extract from the tweet content since these terms vary from cryptocurrency tweets to company tweets, so that the presence of specific terms can determine the category membership with high likelihood. Figure 6 visualizes the feature most common terms as a tag cloud. The figures show: (1) terms like coin, crypto, cryptocurrency, binanc, signal, fee or join are defining terms for cryptocurrency tweets while rate, group, inc, plc, finance or company for companies; (2) the usage of companies and cryptocurrencies proper names is a defining criterion in both cases; and (3) there area also non-defining terms, mostly referring to market interactions, i.e. buy so they are considered poor signs of category membership. Also most common hashtags can be considered a distinctive feature in the exploratory analysis according to the expert annotation and, in fact, they are similar to most common terms (see Figure 7). For instance #bitcoin, #ethereum, #cryptocurrency, #altcoin, #airdrop or #binance for cryptocurrency tweets and #ftse, #mkt, #premarket or #earnings for company e tweets. As in most common terms there are also common hashtags in the two datasets, i.e. #hold or #buy although with very different percentages.

Refer to captionRefer to caption
Figure 6: Cryptocurrency (blue) & Company (green) text word cloud
Refer to captionRefer to caption
Figure 7: Cryptocurrency (blue) & Company (green) hashtag word cloud

Finally, the exploration of the amount of tickers is especially relevant. While 3 is the average of tickers and 1 the median for company tweets, the average and median rise to 18 and 20 respectively for cryptocurrency tweets (Figure 8. As outliers with a large number of company tickers exist, this feature should not be used in isolation.

Refer to caption
Figure 8: Ticker distribution, LSE-100(blue) vs Cryptocurrency(black)

VI-B User information

most common terms in the description of the user (Figure 9) are quite similar to those extracted from the main body, so i.e. crypto and bitcoin are common for cryptocurrency tweets meanwhile i.e. finance and company are so. Moreover, the user description tends to have less formal and more personal words (i.e. enthusiast or love. However, a relevant number of common terms are shared between users’ description in cryptocurrency and company tweets i.e. news or invest, so the users’ descriptions are not as defining as the tweet content in term of category membership.

Refer to captionRefer to caption
Figure 9: Cryptocurrency (blue) & Company (green) user description word cloud

On the other hand, especially for LSE-100, the number of followers and friends was uncovered as a defining feature (Figures 11 and 11); so that followers of users linked to cryptocurrency tweets tend to be quite small (more than three quarters below 200 and most of them below 2 222We interpret these numbers as a result of the existence of secondary accounts to disseminate self-generated tweets) meanwhile the percentage increases for company tweets where above 75% of the users have at least 100 followers. This defining feature results also on differences in the number of retweets and favorites but they are not as remarkable as the number of followers and friends. However, even for users linked to cryptocurrency tweets, there are outlier users with millions of subscribers, which shows that the use of cashtags to refer to cryptocurrencies is a widespread phenomenon.

Figure 10: Follower distribution by user, LSE-100 (blue) vs Cryptocurrency (black)
Refer to caption
Refer to caption
Figure 10: Follower distribution by user, LSE-100 (blue) vs Cryptocurrency (black)
Figure 11: Friend distribution by user, LSE-100 (blue) vs Cryptocurrency (black)

Finally, and focusing on the account, its verified status, its change of profile and its creation time were considered relevant during the exploration. For the verified status, although the percentage of verified users is not very high, 1% for tweets about companies, they are slightly more frequent than in cryptocurrency accounts (0.1%), which can be also a result of the greater number of followers found in users linked to company tweets. Secondly, most accounts (72%) linked to cryptocurrency tweets have not changed the default profile (which is consistent with accounts for non-personal use but for the diffusion of self-generated messages); meanwhile, for instance, in the case of LSE-100 companies, this percentage falls to 58%. Thirdly, for the account creation time, the accounts linked to LSE-100-company tweets were created between 2009 - 2017; meanwhile, cryptocurrency accounts are recent, from mid-2017 to the present, a period that coincides with the expansion of cryptocurrencies. However, the defining nature of creation time is reduced in case of recent accounts, so it should be combined with other defining features.

VI-C Tweet time and place

During the exploratory analysis, the time of the day with the highest number of tweets is considered the most distinguishing feature (Figure 12). Meanwhile LSE-100 tweets are regularly posted within 10 am and 18 pm GMT, when the stock market is open, cryptocurrency tweets are more stable throughout the day, as they do not have a closing time or a specific geographical area. Regarding the posting location, the percentage of geolocated tweets in the datasets are no significative enough to define a heuristic.

Refer to caption
Figure 12: Tweet time distribution, LSE-100(blue) vs Cryptocurrency(black)

VI-D Main Features

After the exploratory analysis over cryptocurrency and company tweets which contain homonym cashtags, each one has distinctive features to tackle automatic classification. These features are summarized in Table IV

Tweet features
Feature Cryptocurrency tweets LSE-100 tweets

Body

Terms like crypto, coin, binanc or name of cryptocurrencies Many different tickers in the body Terms like group, inc, plc, financ, or name of markets Few tickers per body (one or two)

User

Few followers and friends Accounts created recently Informal Description Verified users (0.1%) Moderate followers and friends Accounts created from 2010 to now Formal Description Verified users (1%)

Time place

Posting during the whole day No geographic information Posting when the LSE is open No geographic information
TABLE IV: Tweet main features

VII Heuristic Filters

ASimple word-based heuristic filter based on the presence of certain key terms was deployed to reduce as much as possible the amount of misclassified company tweets by using terms that identify almost unmistakably cryptocurrency tweets, such as: cryptocurrency, lumen, ethereum, bitcoin, blockchain or stellar and also the cryptocurrencies whose acronyms do not collide with company tickers (Table V).

$SNT, $ADA, $MTH, $ADX, $LSK, $DSR, $ARK, $CLOAK, $TKN, $DLC, $DCR, $KMD,$IQT, $ZCL, $DCY, $ALIS, $RBY, $SYS, $EXP, $BCY, $VEN, $BCN, $BLITZ, $UGT, $GVT, $MONA, $QASH, $DASH, $AUR, $UNO, $BURST, $REQ, $PART, $TRIG, $GCR,$LMC, $XEM, $BNB, $SNGLS, $BITSILVER, $PDC, $ELIX, $XVG, $DOPE, $LEND, $SNRG, $NLG, $ARDR, $QSP, $SALT, $SYNX, $GRC, $XDN, $PIVX, $DCT, $WAVES, $PTOY, $SIB, $LTC, $CPC, $NAS, $XMR, $LOCI, $ION, $VSX, $NXS, $XMY, $GBYTE, $XMG, $IGNIS, $ETP, $BWK, $FCT, $DRGN, $MUE, $XPM, $STEEM, $FTC, $SPHR, $DGB, $DGD, $SUB, $VOX, $MAID, $RPX, $AEON, $XAUR, $MIOTA, $CRC, $BET, $ENG, $XVJ, $POWR, $STORJ, $GUP, $UBQ, $SBD, $INFX, $LGD, $DYN, $INFR, $ONION, $MANA, $SLR, $FUN, $CURE, $BITB, $EMC2, $XZC, $IOTA, $COVAL, $AGRS, $PASC, $DOGE, $XRB, $SWT, $FLDC, $ZEC, $NBT, $XRP, $ETH, $RADS, $ETC, $PANGEA, $CLAM, $PHR, $APX, $BTC, $NEM, $NEO , $MYST, $START, $ENJ, $WTC, $PPT, $STR, $ARDOR, $ITZ, $BCPT, $ITC, $TAAS, $STRAT, $SEQ, $EDG

TABLE V: Cryptocurrencies tickers used in Simple Word-based Heuristic filter
Heuristic filters
Basic Extended
Precision (Company) 0.551 0.609
Recall 0.932 0.999
Specificity 0.980 0.983
Accuracy 0.978 0.983
F-Score 0.692 0.757
TABLE VI: Word-based heuristics: Quality

Reference ticker

Word list

General Cryptocurrencies

coin, crypt, btc, lumen, ethereum, bitcoin, whale, stellar, binanc, blockchain

$NXT(LSE)

plc

$NXT(Crypto)

ignis, ardor, jelurida

$XLM(LSE)

xlmedia

$XLM(Crypto)

rocket, moon, $str, worth, now, trx

$CRW(LSE)

craneware

$APH(LSE)

weed, fire, emc, cannabis, medical, amphenol, aphria, $app, $acb

$BRK(LSE)

amz, aapl, twtr, berkshire, buffet, warren, brookline, brooks, oil

$SKY(LSE)

skyline, fox

$GBG(LSE)

plc, group

$AMS(LSE)

hospital, medical

TABLE VII: Words used in Extended Word-based Heuristic Filter

The quality results of the Simple word-based heuristic classifier are shown in the first column in Table VI, where the precision, although not very high, is much higher than a null model (2.7%), so the filter is a good option to discard a lot of tweets about cryptocurrencies only loosing a limited fraction of tweets about companies. The terms used in Simple word-based heuristic classifier are all specific names of the main current cryptocurrencies or words that refer to them, as would be the case of blockchain or binance. Therefore, the performance of the filter should be maintained in the medium term and decline gradually as the trendy cryptocurrencies change. To avoid this, the list of cryptocurrencies should be updated periodically. As it uses a fixed list of cryptocurrencies, the filter should obtain similar results working with tickers different than those studied. Although the precision and recall values obtained are significantly better than those of the null model, more than a thousand tweets from companies are misclassified, which differs from the initial objective of the filter to achieve a practically perfect recall.

Although all the considered terms in Simple word-based heuristic classifier refer directly to cryptocurrencies, in some company tweets the cryptocurrencies are named even when the captured ticker does not refer to a cryptocurrency, as would happen for $BRK in which various tweets would refer to Berkshire Hathaway while they talk about cryptocurrencies. This is the reason for most of the failures of the heuristic. To avoid this and improve the performance of this filter, it has been optimized, adding a series of different terms depending on the ticker considered (see Table VII) in a filter referred to as Extended word-based heuristic filter. This way, if for example the tweet to consider contains the ticker $NXT, and terms like Ignis or Ardor (elements related to the crypto platforms) the tweet will be classified as belonging to cryptocurrencies. However, if the ticker is $BRK and contains words like Berkshire or Brookline, the tweet will be marked as a company tweet. These specific criterions will have priority over the general ones. So, if they do not coincide, the labelling of the extended filters will be considered.

The classification performance of this filtering system is shown in Table VI. The results of the Extended filter are significantly higher than those of the Simple filter. The recall of the system has increased to 99.9% and only seventeen company tweets are misclassified, a value in line with what was sought for this type of filters. The accuracy of the system has also increased slightly thanks to specific knowledge for each ticker. However, since this new filter takes specific information about a company, it is limited only to the tickers analyzed, and cannot be used for other cases where the interference between company and cryptocurrency happens.

VIII SVM classifiers

Although the heuristic filters successfully detect a large number of tweets about cryptocurrencies, adapting some of the patterns seen during the descriptive analysis to these techniques is complex. Thus, supervised methods have been deployed to effectively split both types of tweets, more specifically, SVMs (Support Vector Machines). Unlike heuristic filters, SVM-based solutions try to achieve a tradeoff between precision and recall, i.e significant improvements in precision at the expense of incorrectly classifying some company tweets. Therefore, the fundamental measurement that will be used to evaluate these classifiers will be the F-score, which allows us to compare the performance of the different classifiers deployed. FTNHDS and AMNHDS have been manually annotated to design the SVMs (The three previously mentioned subsets were used: train set, test set and tune set).

The first result that should be highlighted is the SVM classifier that uses the differentiating features observed during the comparison of both types of tweets as independent variables. Within this set of variables, the posting date has been discarded to not limit the filter to the study period. See the variables in the first SVM approach (Simple SVM Classifier) in Table VIII.

Variable Type

Description

Ticker Factor

Tickers of the different companies

Weekday Integer

Day of the week when the tweet was posted (from 0 to 4)

Hour Integer

Hour of the day when the tweet was posted

Followers Numeric

Log10 account followers

Friends Numeric

Log10 account friends

Favorites Numeric

Log2 account favorites

Dollars Numeric

Log2 number of different tickers in the tweet

DefaultProfile Logical

True if the account has not change the default interface

AccountCreationTime Factor

Moment when the account was created (divided in half years)

TABLE VIII: Independent variables for Simple SVM classifier

According to the performance measurements in Table X, the precision values obtained are significantly higher than heuristic classifiers, reaching values close to 90% with a very low reduction of the recall. In addition, the parameters are considered stable regarding temporal variations, which extends the lifespan of the classifier. The only parameter with a clear temporal component is AccountCreationTime whose application is justified to differentiate accounts created before and after the irruption of cryptocurrencies. Therefore, the performance of the classifier should remain stable in the medium term. To apply Simple SVM to other cryptocurrencies, it should be considered that the ticker used to retrieve the tweet is one of the model parameters so a new model with new tickers’ parameters should be developed when a new ticker appears. On the contrary, a slight degradation in the performance could happen.

To improve lifespan, a model applicable to situations different from those considered during this experiment has been deployed. Moreover, and although the Simple SVM performance is satisfactory, the information about the content of the tweet is not considered with its full potential. In an alternative SVM-based approach, referred to as Extended SVM Classifier, relevant terms from the exploratory analysis and key terms identified during manual annotation were considered to differentiate cryptocurrency and company tweets. A specific vocabulary (Table IX) has been created from these terms and this vocabulary was used to enrich the information through new independent variables (1111 if the word is in the tweet, no matter how many times, and 00 otherwise). A slight improvement in all measurements can be appreciated in Table X, especially accuracy and F-score and even more remarkable in terms of AUC with a value practically equal to 1 even in the test set. Finally, the Extended SVM classifier provides precision values greater than 95% while maintaining a recall higher than 90%. In terms of lifespan and applicability to scenarios different from the one in this experiment, the consideration of terms in the body of the tweet improves the useful life of the classifier without retraining it, since these words refer to cryptocurrencies and companies and not to specific temporal situations, i.e. results are considered stable in the medium term. Likewise, the terms in the vocabulary are mainly general and do not refer to specific cryptocurrencies. Even though, few terms in the vocabulary are related to specific companies or cryptocurrencies, e.g. ardor. There is some little chance of declining performance if the Extended SVM classifier is applied in another time period. Finally, it is worth to mention that both the execution (classification) and especially the training is slower than Simple SVM classifier due to the consideration of the vocabulary as an independent variable which results in a bigger number of support vectors.

Binac, Bitcoin, Signal, Join, Crypto, Fee, Plc, Inc, Group, Company, Finance, Weed, Aapl, Moon, Cannabis, berkshire, Brooks, Ltc, Eth, Dash, Xrp, Xmr, Xem, Nem, Rocket, Jelurida, Ignis, Medical, Buffet, Warren, Stellar

TABLE IX: Vocabulary for the Extended SVM classifier
SVM classifiers
Basic Extended
Precision (Company) 0.898 0.941
Recall 0.897 0.935
Specificity 0.997 0.998
Accuracy 0.995 0.997
F-Score 0.898 0.938
AUC 0.977 0.997
TABLE X: SVM classifiers: Quality

IX Combined Classifiers

In view of the results seen so far, heuristic filters and SVM classifiers can be used together to supplement each other benefits. In this section, Combined Systems are introduced where the results of Extended word-based heuristic filter is considered a parameter for the Extended SVM classifier. The high recall of the former allows a large number of cryptocurrency results to be discarded quickly so that the SVM can focus on precision and, therefore, the combined systems is expected to improve in both metrics. In fact, precision, recall and F-score values close to 0.97 in the test set (Table XI) and AUC also improves slightly. Therefore, almost all of the tweets are positively classified, misclassifying only a small percentage.

Combined Classifiers
Original Independent
Precision (Company) 0.976 0.933
Recall 0.968 0.855
Specificity 0.999 0.998
Accuracy 0.999 0.995
F-Score 0.972 0.892
AUC 0.9994 0.988
TABLE XI: Combined Classifiers: Quality

The lifespan and applicability of the Combined Systems are identical to the SVM classifiers above; the results should remain stable in the medium term, but the benefits will be slightly lower if they are applied to other coincident tickers not included. However, the execution time is slower than SVM classifiers since it requires the consecutive execution of heuristic filters and SVM classifiers. Finally, the working point of the system can be adjusted to obtain slightly higher values of precision or recall depending on the needs.

Although the results are stable in the medium term, the potential to apply the combined system in other scenarios (different from this experiment) is smaller due to the features of some of the variables used in the Extended word-based heuristic filter. Therefore, a combined system from the results of Simple word-baed Heuristic filter (only general information) can be more stable because, instead of words related to specific cashtags, the basic filter is used. To produce a combined system as stable as possible, captured ticker information and terms related to specific tickers have been discarded in the Extended SVM classifier: the variables and vocabulary for the new Independent Combined System are shown in XII and XIII. The generalization process enables a classifier easily applicable to scenarios of colliding company tickers and cryptocurrencies similar to the one studied in this experiment, with a similar performance in the test set (see Table XI). Although a slight fall can be observed, especially noticeable for recall, precision continues still high, exceeding 90%; accuracy is greater than 99%; and AUC remains high, which allows adjusting other solutions that optimize the precision or recall depending on the desired features. The Independent Combined System classifier does not use variables with high temporal variability, as the Combined System, so the results should be stable in the medium term.

Variable Type

Description

Weekday Integer

Day of the week of posting (from 0 to 4)

Followers Numeric

Log10 account followers

Friends Numeric

Log10 account friends

Favorites Numeric

Log2 account favorites

Dollars Numeric

Log2 number of different tickers

DefaultProfile Logical

True for accounts with the default interface

AccountCreationTime Factor

Account creation time (divided in half years)

TABLE XII: Variables in Independent Combined Classifier

Binac, Bitcoin, Signal, Join, Crypto, Fee, Plc, Inc, Group, Company, Finance, Aapl, Moon, Ltc, Eth, Dash, Xrp, Xmr, Xem, Nem, Rocket

TABLE XIII: Vocabulary in Extended SVM Classifier for the Independent Combined Classifier

X LSTM classifiers

The heuristic filters, SVM classifiers and the combined classifying systems proposed so far consider a set of terms as a relevant representation of the tweet content. However, none of them considers the relative importance of each of these terms or the relationship that may exist among them. For this reason, the aforementioned combined classifying systems have been adapted to work according to LSTM (Long short-term memory network) classifiers, as mentioned, a type of recurrent network. Thus, instead of a set of terms, an embedded matrix that collects the relative importance and inter-relationship of terms is used in the classifier. Moreover, it is fair to mention that LSTM adaptation allows a greater number of terms in the vocabulary without an excessive increase in the number of independent variables.

FTHDS and AMHDS have been used as input to an LSTM network in order to obtain the embedding matrix. In particular, an LSTM network aims to predict the next word from a set by considering the previous words. For this, a matrix is generated which includes the weight of the considered term (as a measure of its relevance for the problem) and the relationships among terms. The matrix together with the network are iteratively trained so a suitable number of terms should be defined to guarantee and affordable computational complexity. In our experiment, to obtain an LSTM-based classifier, tweets are represented as vectors and the vocabulary consists of the 10,000 most common terms within the homonymous tweets. Before LSTM training, a pre-processing step is applied, which includes the following tasks: (1) removing weird characters, punctuation, emoticons, URLs and stop words; (2) discarding tickers and names of cashtags and extremely common terms; and (3) stemming. After this preprocessing step, the resulting terms have a higher representative capacity and the final vocabulary consists of the 9,998 terms, in addition to a term for those terms not collected and another for the break line. After training the LSTM network, the matrix together with the tweets’ vectors are used to generate new independent variables by multiplying tweets’ vectors by the LSTM matrix. As a result, in addition to a significant reduction in the number of variables (from 10,000 to 200), a more relevant (in terms of the problem) representation of the tweet body is obtained. To sum up, for the LSTM approach in our experiment, a tweet is represented by a vector of 200 variables, which are the input independent variables of the SVM classifier.

LSTM-SVM Classifiers
Original Independent
Precision (Company) 0.981 0.967
Recall 0.969 0.928
Specificity 0.9995 0.999
Accuracy 0.999 0.997
F-Score 0.975 0.947
AUC 0.9990 0.992
TABLE XIV: LSTM-adapted SVM: Quality

The Combined Classifiers in Section IX are deployed again but using the results of the embedding matrix instead of the list of common terms (basic or extended). As shown in Table XIV, there are no major changes in performance. Since the extra computational load in LSTM-SVM classifier does not provide a significant improvement in performance (slight improvement from previous classifiers) is not worthy to be considered as an isolated solution, however, LSTM can produce performance improvements in the combined classifiers. On the other hand, limitations and applicability of the LSTM-adapted SVM classifier are the same as SVM classifiers. The LSTM-adapted SVM classifier may show a small drop in its performance when used in scenarios with companies different from the one considered in this experiment; but LSTM-adapted SVM classifier maintain the performance in the medium term for the same company scenario since variables with a fairly clear temporal variation are not in the model.

Unlike the LSTM-adapted SVM classifier, the Independent LSTM-adapted SVM classifier provides large improvements compared to the same model with key terms. These improvements are especially noticeable for the precision, recall and F-measure of the system, surpassing 0.92 for all of them, unlike the 0.855 of the previous independent classifier. The significantly greater vocabulary used increases the representative capacity and compensates the reduction in the other fields. In fact, the benefits obtained are similar to those of the LSTM-adapted SVM classifier, which virtually classifies correctly all tweets. For this reason, the use of this model would be advisable to process homonym tickers different from those studied, although with a computational overload due to the large size of the vocabulary, embedding matrix and support vectors. As with the previous classifiers, it does not use variables with high temporal variability, so the results obtained should be maintained in the medium term. To maintain long term benefits, it would be necessary to update the temporary matrix every few months to adapt to changes in the new terms used. However, given the large size of the vocabulary, most of them should not change. So, the performance of the system should reduce more slowly than the independent classifier.

XI Logistic-regression-based Classifiers

Despite the good results obtained through the precious SVM classifiers, the execution and especially the training of SVMs can be slow and, more importantly, they are grey-box models so hard to interpret their results to further improvements. As a white-box alternative, a set of logistic-regression-based classifiers were considered, which, in addition, are simpler and so faster. If similar quality results can be obtained with a simpler model, the extra complexity may be unjustified. In our experiment, SVM was replaced by logistic regression but, given the high computational cost of the LSTM network, it has not been used in the logistic-regression-based models. Keep in mind that the aim is obtaining a simpler and white-box solution. See the quality results of these classifiers in Table XV.

LR Classifiers
Basic Ext. Comb. Ind.
Precision (Company) 0.816 0.914 0.950 0.871
Recall 0.807 0.872 0.960 0.801
Specificity 0.995 0.998 0.999 0.997
Accuracy 0.990 0.994 0.998 0.991
F-Score 0.812 0.892 0.955 0.835
AUC 0.977 0.993 0.9997 0.986
TABLE XV: Logistic-regression-based: Quality (Basic, Extended, Combined, Independent)

Although the quality of all the logistic regression models are lower than SVM-based classifier, the fall is not very significant, the execution time of these logistic-regression-based classifiers is significantly lower (five times faster and more). The basic logistic regression classifier is especially noteworthy since the tweet content does not need to be processed so it can be trained and applied really fast. Regarding limitations, logistic-regression-based Classifiers maintain similar constraints and restrictions as SVM-based classifier because the same independent variables are used.

XII Limitations

In this experiment, different alternatives have been explored to disambiguate homonyms terms in the LSE-100 and in the cryptocurrency market with the final aim of clearly distinguish tweets referring to companies in regulated markets (LSE-100 in the experiment) and tweets regarding cryptocurrencies and so in a not regulated market. First, word-based heuristic filters have the main benefit of discarding a large number of cryptocurrency tweets without practically miss-classifying any company tweet, so they achieve high recall values with acceptable levels of precision. Secondly, classifiers based on supervised methods provide a tradeoff between precision and recall, maximizing the F-score quality measure. For both heuristic filters and supervised classifiers, different alternatives have been explored and analyzed in terms of quality measures for binary classification and in terms of computational load. High-quality results have been obtained for the more complex and computationally expensive models.

In a second step and from the supplementary benefits of heuristic filters and SVMs, we have explored the combined deployment of both types of classifying models. As a result, we obtain a Combined System which AUC values very close to 1111 and F-score above 0.975. Thirdly, classifiers able to identify company and cryptocurrency tweets that do not use information related to any of the studied cryptocurrencies are also studied. These Independent Models, despite a small decrease in classification quality, still maintain high levels of precision and recall, especially if they use an LSTM embedding matrix instead of a fixed list of key terms. These cryptocurrency-independent models offer the potential to be used in scenarios different from the experiment in this paper. Also, their working points in AUC can be adjusted to the best option in a specific problem. Finally, logistic regression as a less computationally expensive and a more interpretable solution has been applied in the experiment, and especially for the case of Extended logistic regression classifier, a quick initial classification scheme can be deployed.

Regarding the limitations of the developed classifier systems, two cases should be differentiated. In the first place, there would be those classifiers that use information that refers to some of the tickers considered in the experiment, such as Extended Word-based Heuristic Filter, Extended SVM Combined System, or LSTM-adapted SVM Classifier. They use input variables as company tickers or key terms related to some of the cashtag in the experiment to achieve an improvement in classification quality. This means that their performance for company tickers not in the experiment may be lower. Thus, they are especially suitable to work with the companies of the LSE-100 but their benefits fall outside this stock index.

In the second place, classifying systems, which do not use information regarding the cashtags in the experiment, maintain their performance in scenarios out of the tickers in the experiment (independent classifiers) like LSTM independent classifier or simple word-based Heuristic Filter. For these classifiers, the quality performance is stable in the medium term, since information with time-dependent nature is not considered apart from the account creation time. However, creation time is merely included to differentiate accounts created before and after the popularization of cryptocurrencies. Thus, they should continue to work properly. However, firstly, popular cryptocurrencies may vary from time to time and the list of cryptocurrency tickers should be updated every few months to maintain the classifying performance of the system; and secondly common terms form tweet content may also vary. To sum up, the embedding matrix should be re-computed every few months to keep the classifying performance of the system. The other independent variables considered should maintain regular behavior, at least in the medium term.

XIII Model Evaluation and Selection

Given that Independent Classifiers are considered superior in term of generalization to other markets, in this section, the three independent models are further evaluated to check whether there are a statistical difference among them: Independent SVM Combined Classifier (SVM-Ind), Independent LTSM-adapted SVM Classifier (LSTM-Ind) and Independent Logistic-Regression based Classifier (LR-Ind). Non-parametric statistical tests are applied to SVM-Ind, LTSM -Ind and LR-Ind [Raschka(2018)]: (1) The McNemar test (for paired comparisons) which compares the performance of two machine learning classifiers; and (2) The Cochran’s Q test as a generalized version of McNemar’s test that can be applied to compare three or more classifiers.

The Cochran’s Q test is a nonparametric statistical test to evaluate the null hypothesis. If the test result suggests that there is insufficient evidence to reject the null hypothesis, then any difference observed in the performance of the models is probably due to statistical chance. Conversely, if the test rejects the null hypothesis, it is likely that the different performances are due to a difference in the models.

Let {C1,,CM}subscript𝐶1subscript𝐶𝑀\{C_{1},\dots,C_{M}\}{ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } be a set of classifiers who have all been tested on the same dataset.

If the M𝑀Mitalic_M classifiers do not perform differently, then the following Q𝑄Qitalic_Q statistic is distributed approximately as chi squared with M1𝑀1M-1italic_M - 1 degrees of freedom:

Q=(M1)Mi=1MGi2T2MTj=1Nts(Mj)2𝑄𝑀1𝑀subscriptsuperscript𝑀𝑖1superscriptsubscript𝐺𝑖2superscript𝑇2𝑀𝑇subscriptsuperscriptsubscript𝑁𝑡𝑠𝑗1superscriptsubscript𝑀𝑗2Q=(M-1)\frac{M\sum^{M}_{i=1}G_{i}^{2}-T^{2}}{MT-\sum^{N_{ts}}_{j=1}(M_{j})^{2}}italic_Q = ( italic_M - 1 ) divide start_ARG italic_M ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M italic_T - ∑ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

Here, Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of objects out of Nts×Msubscript𝑁𝑡𝑠𝑀N_{ts}\times Mitalic_N start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT × italic_M correctly classified by Ci=1,M;Mjsubscript𝐶𝑖1𝑀subscript𝑀𝑗C_{i}=1,\dots M;M_{j}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , … italic_M ; italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of classifiers out of M that correctly classified object

𝐳j𝐙ts,where𝐙ts={𝐳1,𝐳Nts}formulae-sequencesubscript𝐳𝑗subscript𝐙𝑡𝑠𝑤𝑒𝑟𝑒subscript𝐙𝑡𝑠subscript𝐳1subscript𝐳subscript𝑁𝑡𝑠\mathbf{z}_{j}\in\mathbf{Z}_{ts},where\mathbf{Z}_{ts}=\{\mathbf{z}_{1},...% \mathbf{z}_{N_{ts}}\}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_Z start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT , italic_w italic_h italic_e italic_r italic_e bold_Z start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT = { bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … bold_z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT }

is the test dataset on which the classifiers are tested on; and T is the total number of correct number of votes among the M classifiers:

T=i=1MGi=j=1NtsMj𝑇superscriptsubscript𝑖1𝑀subscript𝐺𝑖subscriptsuperscriptsubscript𝑁𝑡𝑠𝑗1subscript𝑀𝑗T=\sum_{i=1}^{M}G_{i}=\sum^{N_{ts}}_{j=1}M_{j}italic_T = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

The Cochran’s Q test can be considered as a generalized version of the McNemar’s test, which is applied to compare the predictions of two models to each other to evaluate the null hypothesis.

χ2=(BC)2(B+C)superscript𝜒2superscript𝐵𝐶2𝐵𝐶\chi^{2}=\frac{(B-C)^{2}}{(B+C)}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ( italic_B - italic_C ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_B + italic_C ) end_ARG

where B and C are the predictions in which the two models differ: one made a correct prediction an the other an incorrect prediction, or vice versa.

In this study, both test are applied to the three Independent Models, as a result, we get the Q𝑄Qitalic_Q and χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values shown in Table XVI.

Q | χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p-value
Cochran’s Q 678.135 5.56e-148
McNemar’s SVM- Ind vs LR-Ind 317.071 6.29e-71
McNemar’s SVM-Ind vs LSTM-Ind 109.835 1.07e-25
McNemar’s LSTM-Ind vs LR-Ind 489.507 1.82e-108
TABLE XVI: Cochran’s Q and McNemar’s test (all dataset)

To avoid the effect of a test set that is too large [Figueiredo(2013), Lin et al.(2013)Lin, Lucas Jr and Shmueli], we divide the data into subsets of 10,000 random samples and calculate the average of the values obtained. As a result, we get the Q and χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values shown in Table XVII.

Q | χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p-value
Cochran’s Q 35.239 5.99e-05
McNemar’s SVM-Ind vs LR-Ind 15.528 0.00333
McNemar’s SVM-Ind vs LSTM-Ind 6.991 0.0744
McNemar’s LSTM-Ind vs LR-Ind 25.184 7.82e-05
TABLE XVII: Cochran’s Q and McNemar’s test

In view of the result of the Cochran’s Q test, assuming a significance level of α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, we can reject the null hypothesis since the corresponding p-value is lower. However, the McNemar’s test gives us further information when comparing the two-to-two models: we can reject the null hypothesis between the SVM-Ind and LSTM-Ind models versus the LR-Ind model, but not between the SVM-Ind model and the LSTM-Ind model. Consequently, we can conclude that the improvements in performance obtained with the Independent LSTM-adapted SVM classifier in our study may be due to statistical chance, so it is not worth the greater complexity and computational cost it requires.

XIV Conclusions and future lines

Despite cashtags are the main mechanisms to track financial information on Twitter, the irruption of the cryptocurrencies has produced a degradation in the quality of the information obtained through cashtag tracking due to the fact that some cryptocurrency acronyms collide with company tickers (homonym tickers) in regulated markets. When a cashtag is used on Twitter, homonym acronyms can extract results referring to stock companies and to cryptocurrencies indistinctly. In addition, most cryptocurrency tweets are self-generated spam messages that multiply the negative effect of the homonyms acronyms with a bid degradation of the informative capacity the cashtag seeks to obtain. Thus, new disambiguation mechanisms -or the adaptation of existing ones- are necessary to solve the problem and restore the original informative capacity of the cashtag tracking mechanism.

Meanwhile, most of the current researches are focused on the potential of Twitter as a predictive tool for decision making on financial markets and the development of expert systems using such information, the approach of this paper is focused on a completely different objective: to illustrate the negative impact of the increasing popularity of cryptocurrencies in the cashtag mechanism through an experiment on LSE-100 companies. The aim is the deployment of classifying systems able of differentiating company and cryptocurrency.

Based on these features, different classifying systems have been introduced to identify or distinguish cryptocurrency and company tweets for the case of homonyms acronyms. Word-based Heuristic Filters pursue discarding a large number of cryptocurrency tweets without practically misclassifying company tweets, so they achieve high recall values with acceptable levels of precision. On the other hand, classifiers based on supervised methods provide a tradeoff between precision and recall by maximizing F-Score and high-quality results have been obtained for the more complex and computationally expensive models. In addition, we have analyzed the combined action of heuristics and supervised models, which results in an approach reaching AUC values very close to 1111 and F-score above 0.970.970.970.97.

Regarding applicability and the ability to update to other scenarios, classifiers that do not use information related to any of the studied cryptocurrencies have been also studied. These models, despite a slight decrease in classifying metrics, still preserve a high level of precision and recall, especially when they do not use a fix list of key terms but a dynamic LSTM matrix. This good performance opens the possibility to use these solutions in different scenarios from the one studied in this experiment. Finally, the work point of the model in AUC can be adjusted according to the problem needs, a simple solution based on logistic regression classifier can be used as an initial classifier to obtain a quick estimate for the classification.

During this paper, the influence of cryptocurrency tweets in the cashtag results is analyzed for the main LSE-100 companies. Although during the study period, from July 1 2017 to February 15 2018, the interference between the cryptocurrencies and tickers of LSE-100 only appeared for the cashtags indicated in our experiment, recently the negative impact of the interference has grown, eg. $SPH (Sinclair pharma (AIM-100) vs Sphere(coin)), $REDD (Redde (AIM-100) vs Reddcoin) and $SMT (Scottish mortgage investment trust plc(FTSE-100) vs SmartMesh(coin)). All these new cryptocurrencies increased their popularity highly before the study period. On the other hand, homonym tickers between cryptocurrencies and stock companies are also found in other markets such as the NSQE or NASDAQ. As the Independent Models are the most detached from training data, they have the greatest potential to be used in different stock markets so that they provide trans-applicability meanwhile they retain performance in other contexts. Nonetheless, our future work also addresses testing the applicability of independent classifiers in up-to-date LSE-100 scenarios and other regulated stock markets to check the performance in the presence of other colliding cashtags. This applicability testing will also pursue to measure the adaptation cost of non-independent classifiers to these new cases.

References

  • [Al Nasseri et al.(2014)Al Nasseri, Tucker and de Cesare] Al Nasseri, A., Tucker, A., de Cesare, S., 2014. Big data analysis of stocktwits to predict sentiments in the stock market, in: International Conference on Discovery Science, Springer. pp. 13–24.
  • [Bordino et al.(2012)Bordino, Battiston, Caldarelli, Cristelli, Ukkonen and Weber] Bordino, I., Battiston, S., Caldarelli, G., Cristelli, M., Ukkonen, A., Weber, I., 2012. Web search queries can predict stock market volumes. PLOS ONE 7, 1–17. URL: https://doi.org/10.1371/journal.pone.0040014, doi:10.1371/journal.pone.0040014.
  • [Cavalcante et al.(2016)Cavalcante, Brasileiro, Souza, Nobrega and Oliveira] Cavalcante, R.C., Brasileiro, R.C., Souza, V.L., Nobrega, J.P., Oliveira, A.L., 2016. Computational intelligence and financial markets: A survey and future directions. Expert Systems with Applications 55, 194 – 211. URL: http://www.sciencedirect.com/science/article/pii/S095741741630029X, doi:https://doi.org/10.1016/j.eswa.2016.02.006.
  • [Cazzoli et al.(2016)Cazzoli, Sharma, Treccani and Lillo] Cazzoli, L., Sharma, R., Treccani, M., Lillo, F., 2016. A large scale study to understand the relation between twitter and financial market, in: 2016 Third European Network Intelligence Conference (ENIC), pp. 98–105.
  • [Ceccarelli et al.(2016)Ceccarelli, Nidito and Osborne] Ceccarelli, D., Nidito, F., Osborne, M., 2016. Ranking financial tweets, in: ACM (Ed.), Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’16), pp. 527–528.
  • [Cortez et al.(2016)Cortez, Oliveira and Ferreira] Cortez, P., Oliveira, N., Ferreira, J.P., 2016. Measuring user influence in financial microblogs: experiments using stocktwits data, in: ACM (Ed.), WIMS’16 Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics.
  • [Dickinson and Hu(2015)] Dickinson, B., Hu, W., 2015. Sentiment analysis of investor opinions on twitter. Social Networking 4, 62.
  • [Ding et al.(2015)Ding, Zhang, Liu and Duan] Ding, X., Zhang, Y., Liu, T., Duan, J., 2015. Deep learning for event-driven stock prediction, in: Twenty-Fourth International Joint Conference on Artificial Intelligence.
  • [Dredze et al.(2016)Dredze, Kambadur, Kazantsev, Mann and Osborne] Dredze, M., Kambadur, P., Kazantsev, G., Mann, G., Osborne, M., 2016. How twitter is changing the nature of financial news discovery, in: ACM (Ed.), Proceedings of the Second International Workshop on Data Science for Macro-Modeling.
  • [Elliott et al.(2017)Elliott, Grant and Rennekamp] Elliott, W.B., Grant, S.M., Rennekamp, K.M., 2017. How disclosure features of corporate social responsibility reports interact with investor numeracy to influence investor judgments. Contemporary Accounting Research 34, 1596–1621.
  • [Fernández Vilas et al.(2019)Fernández Vilas, Díaz Redondo, Crockett, Owda and Evans] Fernández Vilas, A., Díaz Redondo, R.P., Crockett, K., Owda, M., Evans, L., 2019. Twitter permeability to financial events: an experiment towards a model for sensing irregularities. Multimedia Tools and Applications 78, 9217–9245.
  • [Figueiredo(2013)] Figueiredo, D., 2013. When is statistical significance not significant? Brazilian Political Science Review (Online) 7.
  • [Gunduz and Cataltepe(2015)] Gunduz, H., Cataltepe, Z., 2015. Borsa istanbul (bist) daily prediction using financial news and balanced feature selection. Expert Systems with Applications 42, 9001 – 9011. URL: http://www.sciencedirect.com/science/article/pii/S0957417415005187, doi:https://doi.org/10.1016/j.eswa.2015.07.058.
  • [Hentschel and Alonso(2014)] Hentschel, M., Alonso, O., 2014. Follow the money: A study of cashtags on twitter. First Monday 19.
  • [Li et al.(2017)Li, Dai, Park and Park] Li, G., Dai, J.S., Park, E.M., Park, S.T., 2017. A study on the service and trend of fintech security based on text-mining: focused on the data of korean online news. Journal of Computer Virology and Hacking Techniques 13, 249–255. URL: https://doi.org/10.1007/s11416-016-0288-9, doi:10.1007/s11416-016-0288-9.
  • [Liew and Budavári(2016)] Liew, J.K.S., Budavári, T., 2016. Do tweet sentiments still predict the stock market? SSRN Electronic Journal 2820269.
  • [Lin et al.(2013)Lin, Lucas Jr and Shmueli] Lin, M., Lucas Jr, H.C., Shmueli, G., 2013. Research commentary—too big to fail: large samples and the p-value problem. Information Systems Research 24, 906–917.
  • [Liu et al.(2015)Liu, Wu, Li and Li] Liu, L., Wu, J., Li, P., Li, Q., 2015. A social-media-based approach to predicting stock comovement. Expert Systems with Applications 42.
  • [Nassirtoussi et al.(2011)Nassirtoussi, Aghabozorgi, Wah and Ngo] Nassirtoussi, A.K., Aghabozorgi, S., Wah, T.Y., Ngo, D.C.L., 2011. Twitter mood predicts the stock market. Journal of Computational Science 2, 1 – 8. URL: http://www.sciencedirect.com/science/article/pii/S187775031100007X, doi:https://doi.org/10.1016/j.jocs.2010.12.007.
  • [Nassirtoussi et al.(2015)Nassirtoussi, Aghabozorgi, Wah and Ngo] Nassirtoussi, A.K., Aghabozorgi, S., Wah, T.Y., Ngo, D.C.L., 2015. Text mining of news-headlines for forex market prediction: A multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications 42, 306 – 324. URL: http://www.sciencedirect.com/science/article/pii/S0957417414004801, doi:https://doi.org/10.1016/j.eswa.2014.08.004.
  • [Nguyen et al.(2015)Nguyen, Shirai and Velcin] Nguyen, T.H., Shirai, K., Velcin, J., 2015. Sentiment analysis on social media for stock movement prediction. Expert Systems with Applications 42, 9603–9611.
  • [Oliveira et al.(2016)Oliveira, Cortez and Areal] Oliveira, N., Cortez, P., Areal, N., 2016. Stock market sentiment lexicon acquisition using microblogging data and statistical measures. Decision Support Systems 85, 62 – 73. URL: http://www.sciencedirect.com/science/article/pii/S0167923616300240, doi:https://doi.org/10.1016/j.dss.2016.02.013.
  • [Oliveira et al.(2017)Oliveira, Cortez and Areal] Oliveira, N., Cortez, P., Areal, N., 2017. The impact of microblogging data for stock market prediction: Using twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Systems with Applications 73, 125 – 144. URL: http://www.sciencedirect.com/science/article/pii/S0957417416307187, doi:https://doi.org/10.1016/j.eswa.2016.12.036.
  • [Pagolu et al.(2016)Pagolu, Reddy, Panda and Majhi] Pagolu, V.S., Reddy, K.N., Panda, G., Majhi, B., 2016. Sentiment analysis of twitter data for predicting stock market movements, in: 2016 international conference on signal processing, communication, power and embedded system (SCOPES), IEEE. pp. 1345–1350.
  • [Pai and Liu(2018)] Pai, P., Liu, C., 2018. Predicting vehicle sales by sentiment analysis of twitter data and stock market values. IEEE Access 6, 57655–57662.
  • [Rajesh and Gandy(2016)] Rajesh, N., Gandy, L., 2016. Cashtagnn: Using sentiment of tweets with cashtags to predict stock market prices, in: 11th International Conference on Intelligent Systems: Theories and Applications (SITA), IEEE.
  • [Ranco et al.(2015)Ranco, Aleksovski, Caldarelli, Grčar and Mozetič] Ranco, G., Aleksovski, D., Caldarelli, G., Grčar, M., Mozetič, I., 2015. The effects of twitter sentiment on stock price returns. PloS one 10.
  • [Ranco et al.(2016)Ranco, Bordino, Bormetti, Caldarelli, Lillo and Treccani] Ranco, G., Bordino, I., Bormetti, G., Caldarelli, G., Lillo, F., Treccani, M., 2016. Coupling news sentiment with web browsing data improves prediction of intra-day price dynamics. PLOS ONE 11, 1–14. URL: https://doi.org/10.1371/journal.pone.0146576.
  • [Rao and Srivastava(2014)] Rao, T., Srivastava, S., 2014. Twitter Sentiment Analysis: How to Hedge Your Bets in the Stock Markets. Springer International Publishing, Cham. pp. 227–247.
  • [Raschka(2018)] Raschka, S., 2018. Model evaluation, model selection, and algorithm selection in machine learning. CoRR abs/1811.12808. URL: http://arxiv.org/abs/1811.12808, arXiv:1811.12808.
  • [Ruiz et al.(2012)Ruiz, Hristidis, Castillo, Gionis and Jaimes] Ruiz, E.J., Hristidis, V., Castillo, C., Gionis, A., Jaimes, A., 2012. Correlating financial time series with micro-blogging activity, in: Proceedings of the fifth ACM international conference on Web search and data mining, ACM. pp. 513–522.
  • [Shutes et al.(2016)Shutes, McGrath, Lis and Riegler] Shutes, K., McGrath, K., Lis, P., Riegler, R., 2016. Twitter and the us stock market: The influence of micro. bloggers on share prices. Economics and Business Review 2.
  • [Sprenger et al.(2014)Sprenger, Tumasjan, Sandner and Welpe] Sprenger, T.O., Tumasjan, A., Sandner, P.G., Welpe, I.M., 2014. Tweets and trades: the information content of stock microblogs. Eur Financial Management 20, 926–957.
  • [Wang et al.(2012)Wang, Huang and Wang] Wang, B., Huang, H., Wang, X., 2012. A novel text mining approach to financial time series forecasting. Neurocomputing 83, 136 – 145. URL: http://www.sciencedirect.com/science/article/pii/S0925231211007302, doi:https://doi.org/10.1016/j.neucom.2011.12.013.
  • [Yan Li et al.(2006)Yan Li, Shiu and Pal] Yan Li, Shiu, S.C.K., Pal, S.K., 2006. Combining feature reduction and case selection in building cbr classifiers. IEEE Transactions on Knowledge and Data Engineering 18, 415–429.
  • [Zhang(2013)] Zhang, L., 2013. Sentiment analysis on Twitter with stock price and significant keyword correlation. Ph.D. thesis.
  • [Zhang et al.(2018)Zhang, Qu, Huang, Fang and Yu] Zhang, X., Qu, S., Huang, J., Fang, B., Yu, P., 2018. Stock market prediction via multi-source multiple instance learning. IEEE Access 6, 50720–50728.