Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyword extraction results vs YAKE #25

Closed
db1981 opened this issue Feb 17, 2021 · 10 comments
Closed

Keyword extraction results vs YAKE #25

db1981 opened this issue Feb 17, 2021 · 10 comments

Comments

@db1981
Copy link

db1981 commented Feb 17, 2021

Hi Maarten,
I was super excited when I found out about your project because I wasn't happy with the results of the "static" algorithms (TF-IDF, RAKE, etc) and I thought that adding the Transformers twist could have been a game changer.
I just ran KeyBERT on a bunch of text and unfortunately the results are far from what I expected...I wanted to understand if I'm missing something in the configuration...I ran a comparison against RAKE, which I believe delivers a good selection. I highlighted what I believe are the right keywords among those extracted.

abb_brief.txt: Six steps to predict...

KeyBert
N-gram 1
['powerful', 'crucial', 'heuristic', 'heuristics', 'holistic']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['engineers', 'training', 'costly', 'decades', 'holistic']

N-gram 1 | mmr True | diversity 0.7
['powerful', 'environmental', 'oil', 'decades', 'conducting']

N-gram 1 | mmr True | diversity 0.2
['powerful', 'decades', 'engineers', 'holistic', 'oil']

Yake
[ ('maintenance', 0.009284128714982649),
('predictive maintenance', 0.014353824666396151),
('asset', 0.017838983298883827),
('equipment', 0.01827858388518511),
('assets', 0.01921121278341335),
('data', 0.02202428913616752),
('performance', 0.02562450472920611),
('system', 0.030684635541426298),
('predictive', 0.03602109202391525),
('predictive maintenance strategy', 0.04021698272687705),
('preventative maintenance', 0.04097700545772903),
('key', 0.045187052701102695),
('asset performance', 0.04537221430023765),
('plant', 0.045993250652619805),
('step', 0.04701900159897897),
('asset health', 0.05129570336498491),
('maintenance strategy', 0.05190103991418995),
('asset performance management', 0.053141427401525),
('asset management system', 0.054003690529928906),
('systems', 0.056255165159281556)]

honeywell_brief.txt: Honeywell Brings Ene...

KeyBert
N-gram 1
['oil', 'norway', 'norwegian', 'oslo', 'offshore']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['train', '70mw', 'environmental', 'offshore', 'norway']

N-gram 1 | mmr True | diversity 0.7
['oil', 'norway', 'accounting', 'daily', 'environmental']

N-gram 1 | mmr True | diversity 0.2
['oil', 'norway', 'offshore', 'compressors', 'environmental']

Yake
[ ('edvard grieg', 0.0047243731441952855),
('honeywell brings energy', 0.005048731021543624),
('lundin norway edvard', 0.00774357023651215),
('norway edvard grieg', 0.007859681432647146),
('lundin', 0.01097489749993232),
('honeywell', 0.011659075482465663),
('edvard grieg platform', 0.011747470639833146),
('honeywell forge', 0.012242825657779946),
('lundin norway creates', 0.012446534894446847),
('honeywell brings', 0.012646039494275783),
('asset performance management', 0.013202226795784838),
('brings energy accounting', 0.01421896527633952),
('enterprise performance management', 0.014355323112609336),
('lundin norway', 0.014378904821544412),
('performance management', 0.01779551307545978),
('honeywell forge asset', 0.017857115752163682),
('asset performance', 0.02052810653424872),
('north sea', 0.02116890341510034),
('edvard grieg serves', 0.022453059681042196),
('performance management software', 0.022799786193990212)]

ibm_brief.txt: Essential intelligen...

KeyBert
N-gram 1
['optimizing', 'improving', 'workflow', 'adaptability', 'workflows']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['expanded', 'analytics', 'lifecycle', 'efficiency', 'workflow']

N-gram 1 | mmr True | diversity 0.7
['optimizing', 'ibm', 'lake', 'global', 'lifecycle']

N-gram 1 | mmr True | diversity 0.2
['optimizing', 'workflow', 'improving', 'lifecycle', 'adaptability']

Yake
[ ('asset', 0.019425671610029897),
('maximo', 0.03385207707983861),
('maintenance', 0.043681051375471604),
('data', 0.04682441705459914),
('asset management', 0.0489307794359093),
('eam', 0.06313545614404494),
('management', 0.06687517191726874),
('operational', 0.06982561607664019),
('assets', 0.06993241779610762),
('costs', 0.07471213680215111),
('reduce', 0.07615429287108003),
('essential intelligence', 0.0768640289294993),
('reliable asset management', 0.07907749923254684),
('single', 0.09584152535240265),
('maximo manage', 0.09600993459669702),
('applications', 0.098110217945894),
('cmms', 0.0991284264831352),
('maximo mobile', 0.10147884696539622),
('operations', 0.10275673612171073),
('maximo application suite', 0.10860457497355133)]

aspentech_brief.txt: The Wide-Ranging Imp...

KeyBert
N-gram 1
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['competitive', 'powerful', 'shutdowns', 'robbing', 'toxins']

N-gram 1 | mmr True | diversity 0.7
['degrading', 'california', 'accounting', 'tomorrow', 'safest']

N-gram 1 | mmr True | diversity 0.2
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']

Yake
[ ('unplanned downtime', 0.007983226969004767),
('unplanned shutdowns', 0.010538494059896914),
('unplanned', 0.011342119859815765),
('downtime', 0.01742835270368349),
('reduce unplanned shutdowns', 0.018388374839244975),
('reduce unplanned downtime', 0.02003633232508374),
('technology', 0.021390981645117383),
('reduce unplanned', 0.022972668884322592),
('safety', 0.026816852757341234),
('shutdowns', 0.027532689357221824),
('operations', 0.029102917228258057),
('unplanned shutdown', 0.0368847292096392),
('maintenance', 0.037754074459235676),
('reduce', 0.04587756910104556),
('shutdown', 0.0458878155953697),
('predictive analytics', 0.05089898702999869),
('unplanned shutdowns cost', 0.05166283039628185),
('business', 0.05171852210565718),
('companies', 0.05193281587506164),
('operation', 0.05335534825180644)]

aspentech_blog.txt: From food and bevera...

KeyBert
N-gram 1
['businesses', 'aspentech', 'pharmaceuticals', 'everyone', 'petrochemical']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['owner', 'excited', 'oil', 'pharmaceuticals', 'aspentech']

N-gram 1 | mmr True | diversity 0.7
['businesses', 'eager', 'aspentech', 'oil', 'twins']

N-gram 1 | mmr True | diversity 0.2
['businesses', 'aspentech', 'pharmaceuticals', 'oil', 'petrochemical']

Yake
[ ('asset performance management', 0.001774287093354811),
('performance management', 0.004664558428596046),
('adopting asset performance', 0.006574406683268778),
('apm', 0.023224587127132566),
('actively adopting asset', 0.0241279733584699),
('asset performance', 0.025645678465285548),
('apm technology', 0.029995299363958845),
('technology', 0.04937113116851416),
('gas production', 0.055571440652063854),
('data', 0.05568651853043156),
('pharmaceuticals to oil', 0.057392310280758474),
('oil and gas', 0.057392310280758474),
('actively adopting', 0.057392310280758474),
('’re', 0.05870376376363502),
('assets', 0.06046162516715442),
('myth', 0.06051908857502686),
('understand', 0.06485308801630693),
('management', 0.06736024901319339),
('performance', 0.06892642785642587),
('asset', 0.0725539502005853)]

@MaartenGr
Copy link
Owner

I cannot judge if the keywords seem to be accurate as I do not know which specific texts you have used. Having said that, there are several ways to further improve the results. Many of the highlighted keywords are combined words or keyphrases. It then stands to reason that increasing the n_gram_range of KeyBERT will similarly result in better performance. Moreover, it might be due to the size of the documents. If the documents are quite long, then it becomes more difficult to extract keywords/keyphrases as a document will consist of many topics/subjects.

Could you also share the documents as it is difficult to see where the differences are coming from without them?

@db1981
Copy link
Author

db1981 commented Feb 18, 2021

Hi Maarten,
indeed the original texts are quite long in some cases...I forgot to mention that I ran KeyBERT with Ngram set to 1 to give it a chance to perform better. Below the results with Ngram set to 1-3. I also attached various texts.

Thanks

abb_brief.txt: Six steps to predict...

KeyBert
N-gram 1-3
[ 'management is key',
'most important drivers',
'solutions are vital',
'significant asset optimization',
'vital tools']

N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5
[ 'parameter implemented successfully',
'costly long service',
'business technology changing',
'decades experience helping',
'energy sector important']

N-gram 1-3 | mmr True | diversity 0.7
[ 'energy sector important',
'company marry data',
'requiring extra shutdowns',
'decades experience helping',
'nature machine learning']

N-gram 1-3 | mmr True | diversity 0.2
[ 'energy sector important',
'decades experience helping',
'optimization increased operational',
'monitoring solutions vital',
'business technology changing']

Yake
[ ('maintenance', 0.009284128714982649),
('predictive maintenance', 0.014353824666396151),
('asset', 0.017838983298883827),
('equipment', 0.01827858388518511),
('assets', 0.01921121278341335),
('data', 0.02202428913616752),
('performance', 0.02562450472920611),
('system', 0.030684635541426298),
('predictive', 0.03602109202391525),
('predictive maintenance strategy', 0.04021698272687705),
('preventative maintenance', 0.04097700545772903),
('key', 0.045187052701102695),
('asset performance', 0.04537221430023765),
('plant', 0.045993250652619805),
('step', 0.04701900159897897),
('asset health', 0.05129570336498491),
('maintenance strategy', 0.05190103991418995),
('asset performance management', 0.053141427401525),
('asset management system', 0.054003690529928906),
('systems', 0.056255165159281556)]

honeywell_brief.txt: Honeywell Brings Ene...

KeyBert
N-gram 1-3
[ 'north sea oil',
'norway as honeywell',
'oslo engineers to',
'lundin oslo engineers',
'co2 emission reduction']

N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5
[ 'co2 emission reduction',
'norway honeywell',
'engineer lundin norway',
'energy accounting remote',
'oil gas field']

N-gram 1-3 | mmr True | diversity 0.7
[ 'serves oil gas',
'libraries lundin norway',
'calculation presentation daily',
'best efficiency point',
'roads forever losses']

N-gram 1-3 | mmr True | diversity 0.2
[ 'serves oil gas',
'norway honeywell asset',
'energy accounting remote',
'honeywell forge expert',
'oslo engineers facilitate']

Yake
[ ('edvard grieg', 0.0047243731441952855),
('honeywell brings energy', 0.005048731021543624),
('lundin norway edvard', 0.00774357023651215),
('norway edvard grieg', 0.007859681432647146),
('lundin', 0.01097489749993232),
('honeywell', 0.011659075482465663),
('edvard grieg platform', 0.011747470639833146),
('honeywell forge', 0.012242825657779946),
('lundin norway creates', 0.012446534894446847),
('honeywell brings', 0.012646039494275783),
('asset performance management', 0.013202226795784838),
('brings energy accounting', 0.01421896527633952),
('enterprise performance management', 0.014355323112609336),
('lundin norway', 0.014378904821544412),
('performance management', 0.01779551307545978),
('honeywell forge asset', 0.017857115752163682),
('asset performance', 0.02052810653424872),
('north sea', 0.02116890341510034),
('edvard grieg serves', 0.022453059681042196),
('performance management software', 0.022799786193990212)]

ibm_brief.txt: Essential intelligen...

KeyBert
N-gram 1-3
[ 'uptime improve productivity',
'workflows to accelerate',
'accelerate your industry',
'improve replacement planning',
'improve operational performance']

N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5
[ 'assets lifecycle faster',
'industry transformation unify',
'working helps business',
'improve replacement planning',
'changing conditions helps']

N-gram 1-3 | mmr True | diversity 0.7
[ 'workflows accelerate industry',
'single data lake',
'shelf ios devices',
'20 extend lives',
'predictive maintenance reduce']

N-gram 1-3 | mmr True | diversity 0.2
[ 'workflows accelerate industry',
'changing conditions helps',
'accelerate industry transformation',
'uptime improve productivity',
'platform improving productivity']

Yake
[ ('asset', 0.019425671610029897),
('maximo', 0.03385207707983861),
('maintenance', 0.043681051375471604),
('data', 0.04682441705459914),
('asset management', 0.0489307794359093),
('eam', 0.06313545614404494),
('management', 0.06687517191726874),
('operational', 0.06982561607664019),
('assets', 0.06993241779610762),
('costs', 0.07471213680215111),
('reduce', 0.07615429287108003),
('essential intelligence', 0.0768640289294993),
('reliable asset management', 0.07907749923254684),
('single', 0.09584152535240265),
('maximo manage', 0.09600993459669702),
('applications', 0.098110217945894),
('cmms', 0.0991284264831352),
('maximo mobile', 0.10147884696539622),
('operations', 0.10275673612171073),
('maximo application suite', 0.10860457497355133)]

aspentech_brief.txt: The Wide-Ranging Imp...

KeyBert
N-gram 1-3
[ 'most dangerous conditions',
'downtime costs refinery',
'also disproportionately damaging',
'disproportionately damaging',
'disproportionately damaging just']

N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5
[ 'unplanned downtime hurts',
'years worth toxins',
'financial forced shutdowns',
'tremendous waste fuel',
'shutdown benefits significant']

N-gram 1-3 | mmr True | diversity 0.7
[ 'shutdowns disproportionately damaging',
'predictive analytics companies',
'conveyor coming month',
'plants safer fact',
'300 000 barrels']

N-gram 1-3 | mmr True | diversity 0.2
[ 'shutdowns disproportionately damaging',
'shutdowns represent dangerous',
'hurts productivity profitability',
'shutdown benefits significant',
'significant disruptions business']

Yake
[ ('unplanned downtime', 0.007983226969004767),
('unplanned shutdowns', 0.010538494059896914),
('unplanned', 0.011342119859815765),
('downtime', 0.01742835270368349),
('reduce unplanned shutdowns', 0.018388374839244975),
('reduce unplanned downtime', 0.02003633232508374),
('technology', 0.021390981645117383),
('reduce unplanned', 0.022972668884322592),
('safety', 0.026816852757341234),
('shutdowns', 0.027532689357221824),
('operations', 0.029102917228258057),
('unplanned shutdown', 0.0368847292096392),
('maintenance', 0.037754074459235676),
('reduce', 0.04587756910104556),
('shutdown', 0.0458878155953697),
('predictive analytics', 0.05089898702999869),
('unplanned shutdowns cost', 0.05166283039628185),
('business', 0.05171852210565718),
('companies', 0.05193281587506164),
('operation', 0.05335534825180644)]

aspentech_blog.txt: From food and bevera...

KeyBert
N-gram 1-3
[ 'more businesses gain',
'businesses gain better',
'across all industries',
'as more businesses',
'my business analytics']

N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5
[ 'customer demands important',
'based data businesses',
'excited analytics bring',
'processing pharmaceuticals oil',
'best business analytics']

N-gram 1-3 | mmr True | diversity 0.7
[ 'industry strong synergies',
'wait long disappear',
'game changer world',
'beverage processing pharmaceuticals',
'oil gas production']

N-gram 1-3 | mmr True | diversity 0.2
[ 'industry strong synergies',
'best business analytics',
'processing pharmaceuticals oil',
'businesses gain better',
'industries actively adopting']

Yake
[ ('asset performance management', 0.001774287093354811),
('performance management', 0.004664558428596046),
('adopting asset performance', 0.006574406683268778),
('apm', 0.023224587127132566),
('actively adopting asset', 0.0241279733584699),
('asset performance', 0.025645678465285548),
('apm technology', 0.029995299363958845),
('technology', 0.04937113116851416),
('gas production', 0.055571440652063854),
('data', 0.05568651853043156),
('pharmaceuticals to oil', 0.057392310280758474),
('oil and gas', 0.057392310280758474),
('actively adopting', 0.057392310280758474),
('’re', 0.05870376376363502),
('assets', 0.06046162516715442),
('myth', 0.06051908857502686),
('understand', 0.06485308801630693),
('management', 0.06736024901319339),
('performance', 0.06892642785642587),
('asset', 0.0725539502005853)]
abb_brief.txt
honeywell_brief.txt
ibm_brief.txt
aspentech_blog.txt
aspentech_brief.txt

@MaartenGr
Copy link
Owner

To me, it seems to be of much better quality than the keywords before. Keyphrases can typically capture the document representation better than keywords can.

@zolekode
Copy link

zolekode commented Mar 3, 2021

First of all, really awesome library. Thanks for sharing this @MaartenGr .

I think one of the main issues here is that KeyBert produces a lot of "incomplete" keywords/key-phrases. By incomplete I mean keywords that don't sound completely consistent. For example businesses gain better. The better is just hanging there. However, Yake is purely based on syntax, so @db1981 when you "say" Yake poduces the "right" keywords, this would probably be very subjective.

I still think for that YAKE performed better for the text @db1981 provided above although theoretically KeyBERT would make more sense.

@MaartenGr I have a final question.
I read the code and saw that you have to generate a long list of words or phrases (in this case, candidates), then you compute embeddings to find the most similar words. Would it help if you let YAKE generate the candidate keywords, and use SentBERT to filter out the best ones?

Another benefit for this could be that you will end up with a much smaller and more accurate search space, thereby minimising the memory problem. I think it might be worth a try.

@MaartenGr
Copy link
Owner

I agree with that interpretation of what is currently happening. The coherence of the keyphrases is not checked since it highly depends on the embedding of the keyphrase. As long as the right words are in there, it will find a high similarity with the document and as such will think it suits just fine.

In practice, this will most likely mean that both methods, YAKE and KeyBERT, will differ across use-cases and you will find that one might be much better than the other depending on the documents that were used.

@zolekode It would definitely be a nice addition to this solution. By using YAKE to generate a set of candidates, we can use embeddings, together with MMR, to further optimize the candidates.

Though there are a few disadvantages I foresee with. First, the compute time will increase significantly as we are essentially applying two solutions at the same time. Second, by using YAKE we assume that it generates all the best candidates, which in practice is not true as KeyBERT generates interesting candidates that were missed by YAKE (and vice versa of course).

Having said that, I do think it would be great if there was a way to combine some of the statistics that YAKE uses to generate additional candidates without the issues mentioned above.

@ssubraveti
Copy link

Hi!
First of all, this is a really amazing and useful library. I guess one way to incorporate @zolekode's suggestion would be to just have a general function that takes in document(s) and a list of n-grams, and returns a ranked list of these n-grams based on cosine similarity of the document embedding(s) to the n-grams in the list. This way, KeyBert can also be used to rank a limited set of keyphrases, rather than just being able to rank all possible n-grams in a document. Let me know if that makes sense!

@zolekode
Copy link

@MaartenGr @ssubraveti thanks for putting some thought into this. I was thinking of something similar to @ssubraveti's suggestion. Computing embeddings for all n-grams is very expensive in production. So yake could be also be used to, well, just reduce the search space. At least by a bit. But @MaartenGr you are right, this will mean we assuming Yake does a good preselection which might not always be true. I guess a simple solution is just to enable something like yake_preselection=True/False.

@MaartenGr
Copy link
Owner

@zolekode @db1981 @zolekode In the last few weeks I was quite busy with the new release of BERTopic. Fortunately, that one is now released and I will be spending some time on the next release for KeyBERT. You can find the pr here where I will keep track of any updates, including the roadmap. If you have any suggestions, please let me know either here or there!

@zolekode
Copy link

@MaartenGr thanks for the heads up

@MaartenGr
Copy link
Owner

@ssubraveti @zolekode @db1981
KeyBERT v0.3 was just released and adds the options to use many different backends but also the usage of candidates generated by other extractors. Follow the documentation here for all changes or follow along with the google colab example:

Open In Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants