-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keyword extraction results vs YAKE #25
Comments
I cannot judge if the keywords seem to be accurate as I do not know which specific texts you have used. Having said that, there are several ways to further improve the results. Many of the highlighted keywords are combined words or keyphrases. It then stands to reason that increasing the n_gram_range of KeyBERT will similarly result in better performance. Moreover, it might be due to the size of the documents. If the documents are quite long, then it becomes more difficult to extract keywords/keyphrases as a document will consist of many topics/subjects. Could you also share the documents as it is difficult to see where the differences are coming from without them? |
Hi Maarten, Thanks abb_brief.txt: Six steps to predict... KeyBert N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5 N-gram 1-3 | mmr True | diversity 0.7 N-gram 1-3 | mmr True | diversity 0.2 Yake honeywell_brief.txt: Honeywell Brings Ene... KeyBert N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5 N-gram 1-3 | mmr True | diversity 0.7 N-gram 1-3 | mmr True | diversity 0.2 Yake ibm_brief.txt: Essential intelligen... KeyBert N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5 N-gram 1-3 | mmr True | diversity 0.7 N-gram 1-3 | mmr True | diversity 0.2 Yake aspentech_brief.txt: The Wide-Ranging Imp... KeyBert N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5 N-gram 1-3 | mmr True | diversity 0.7 N-gram 1-3 | mmr True | diversity 0.2 Yake aspentech_blog.txt: From food and bevera... KeyBert N-gram 1-3 | maxsum True | nr_candidates 20 | top_n 5 N-gram 1-3 | mmr True | diversity 0.7 N-gram 1-3 | mmr True | diversity 0.2 Yake |
To me, it seems to be of much better quality than the keywords before. Keyphrases can typically capture the document representation better than keywords can. |
First of all, really awesome library. Thanks for sharing this @MaartenGr . I think one of the main issues here is that KeyBert produces a lot of "incomplete" keywords/key-phrases. By incomplete I mean keywords that don't sound completely consistent. For example I still think for that YAKE performed better for the text @db1981 provided above although theoretically KeyBERT would make more sense. @MaartenGr I have a final question. Another benefit for this could be that you will end up with a much smaller and more accurate search space, thereby minimising the memory problem. I think it might be worth a try. |
I agree with that interpretation of what is currently happening. The coherence of the keyphrases is not checked since it highly depends on the embedding of the keyphrase. As long as the right words are in there, it will find a high similarity with the document and as such will think it suits just fine. In practice, this will most likely mean that both methods, YAKE and KeyBERT, will differ across use-cases and you will find that one might be much better than the other depending on the documents that were used. @zolekode It would definitely be a nice addition to this solution. By using YAKE to generate a set of candidates, we can use embeddings, together with MMR, to further optimize the candidates. Though there are a few disadvantages I foresee with. First, the compute time will increase significantly as we are essentially applying two solutions at the same time. Second, by using YAKE we assume that it generates all the best candidates, which in practice is not true as KeyBERT generates interesting candidates that were missed by YAKE (and vice versa of course). Having said that, I do think it would be great if there was a way to combine some of the statistics that YAKE uses to generate additional candidates without the issues mentioned above. |
Hi! |
@MaartenGr @ssubraveti thanks for putting some thought into this. I was thinking of something similar to @ssubraveti's suggestion. Computing embeddings for all n-grams is very expensive in production. So yake could be also be used to, well, just reduce the search space. At least by a bit. But @MaartenGr you are right, this will mean we assuming Yake does a good preselection which might not always be true. I guess a simple solution is just to enable something like yake_preselection=True/False. |
@zolekode @db1981 @zolekode In the last few weeks I was quite busy with the new release of BERTopic. Fortunately, that one is now released and I will be spending some time on the next release for KeyBERT. You can find the pr here where I will keep track of any updates, including the roadmap. If you have any suggestions, please let me know either here or there! |
@MaartenGr thanks for the heads up |
@ssubraveti @zolekode @db1981 |
Hi Maarten,
I was super excited when I found out about your project because I wasn't happy with the results of the "static" algorithms (TF-IDF, RAKE, etc) and I thought that adding the Transformers twist could have been a game changer.
I just ran KeyBERT on a bunch of text and unfortunately the results are far from what I expected...I wanted to understand if I'm missing something in the configuration...I ran a comparison against RAKE, which I believe delivers a good selection. I highlighted what I believe are the right keywords among those extracted.
abb_brief.txt: Six steps to predict...
KeyBert
N-gram 1
['powerful', 'crucial', 'heuristic', 'heuristics', 'holistic']
N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['engineers', 'training', 'costly', 'decades', 'holistic']
N-gram 1 | mmr True | diversity 0.7
['powerful', 'environmental', 'oil', 'decades', 'conducting']
N-gram 1 | mmr True | diversity 0.2
['powerful', 'decades', 'engineers', 'holistic', 'oil']
Yake
[ ('maintenance', 0.009284128714982649),
('predictive maintenance', 0.014353824666396151),
('asset', 0.017838983298883827),
('equipment', 0.01827858388518511),
('assets', 0.01921121278341335),
('data', 0.02202428913616752),
('performance', 0.02562450472920611),
('system', 0.030684635541426298),
('predictive', 0.03602109202391525),
('predictive maintenance strategy', 0.04021698272687705),
('preventative maintenance', 0.04097700545772903),
('key', 0.045187052701102695),
('asset performance', 0.04537221430023765),
('plant', 0.045993250652619805),
('step', 0.04701900159897897),
('asset health', 0.05129570336498491),
('maintenance strategy', 0.05190103991418995),
('asset performance management', 0.053141427401525),
('asset management system', 0.054003690529928906),
('systems', 0.056255165159281556)]
honeywell_brief.txt: Honeywell Brings Ene...
KeyBert
N-gram 1
['oil', 'norway', 'norwegian', 'oslo', 'offshore']
N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['train', '70mw', 'environmental', 'offshore', 'norway']
N-gram 1 | mmr True | diversity 0.7
['oil', 'norway', 'accounting', 'daily', 'environmental']
N-gram 1 | mmr True | diversity 0.2
['oil', 'norway', 'offshore', 'compressors', 'environmental']
Yake
[ ('edvard grieg', 0.0047243731441952855),
('honeywell brings energy', 0.005048731021543624),
('lundin norway edvard', 0.00774357023651215),
('norway edvard grieg', 0.007859681432647146),
('lundin', 0.01097489749993232),
('honeywell', 0.011659075482465663),
('edvard grieg platform', 0.011747470639833146),
('honeywell forge', 0.012242825657779946),
('lundin norway creates', 0.012446534894446847),
('honeywell brings', 0.012646039494275783),
('asset performance management', 0.013202226795784838),
('brings energy accounting', 0.01421896527633952),
('enterprise performance management', 0.014355323112609336),
('lundin norway', 0.014378904821544412),
('performance management', 0.01779551307545978),
('honeywell forge asset', 0.017857115752163682),
('asset performance', 0.02052810653424872),
('north sea', 0.02116890341510034),
('edvard grieg serves', 0.022453059681042196),
('performance management software', 0.022799786193990212)]
ibm_brief.txt: Essential intelligen...
KeyBert
N-gram 1
['optimizing', 'improving', 'workflow', 'adaptability', 'workflows']
N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['expanded', 'analytics', 'lifecycle', 'efficiency', 'workflow']
N-gram 1 | mmr True | diversity 0.7
['optimizing', 'ibm', 'lake', 'global', 'lifecycle']
N-gram 1 | mmr True | diversity 0.2
['optimizing', 'workflow', 'improving', 'lifecycle', 'adaptability']
Yake
[ ('asset', 0.019425671610029897),
('maximo', 0.03385207707983861),
('maintenance', 0.043681051375471604),
('data', 0.04682441705459914),
('asset management', 0.0489307794359093),
('eam', 0.06313545614404494),
('management', 0.06687517191726874),
('operational', 0.06982561607664019),
('assets', 0.06993241779610762),
('costs', 0.07471213680215111),
('reduce', 0.07615429287108003),
('essential intelligence', 0.0768640289294993),
('reliable asset management', 0.07907749923254684),
('single', 0.09584152535240265),
('maximo manage', 0.09600993459669702),
('applications', 0.098110217945894),
('cmms', 0.0991284264831352),
('maximo mobile', 0.10147884696539622),
('operations', 0.10275673612171073),
('maximo application suite', 0.10860457497355133)]
aspentech_brief.txt: The Wide-Ranging Imp...
KeyBert
N-gram 1
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']
N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['competitive', 'powerful', 'shutdowns', 'robbing', 'toxins']
N-gram 1 | mmr True | diversity 0.7
['degrading', 'california', 'accounting', 'tomorrow', 'safest']
N-gram 1 | mmr True | diversity 0.2
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']
Yake
[ ('unplanned downtime', 0.007983226969004767),
('unplanned shutdowns', 0.010538494059896914),
('unplanned', 0.011342119859815765),
('downtime', 0.01742835270368349),
('reduce unplanned shutdowns', 0.018388374839244975),
('reduce unplanned downtime', 0.02003633232508374),
('technology', 0.021390981645117383),
('reduce unplanned', 0.022972668884322592),
('safety', 0.026816852757341234),
('shutdowns', 0.027532689357221824),
('operations', 0.029102917228258057),
('unplanned shutdown', 0.0368847292096392),
('maintenance', 0.037754074459235676),
('reduce', 0.04587756910104556),
('shutdown', 0.0458878155953697),
('predictive analytics', 0.05089898702999869),
('unplanned shutdowns cost', 0.05166283039628185),
('business', 0.05171852210565718),
('companies', 0.05193281587506164),
('operation', 0.05335534825180644)]
aspentech_blog.txt: From food and bevera...
KeyBert
N-gram 1
['businesses', 'aspentech', 'pharmaceuticals', 'everyone', 'petrochemical']
N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['owner', 'excited', 'oil', 'pharmaceuticals', 'aspentech']
N-gram 1 | mmr True | diversity 0.7
['businesses', 'eager', 'aspentech', 'oil', 'twins']
N-gram 1 | mmr True | diversity 0.2
['businesses', 'aspentech', 'pharmaceuticals', 'oil', 'petrochemical']
Yake
[ ('asset performance management', 0.001774287093354811),
('performance management', 0.004664558428596046),
('adopting asset performance', 0.006574406683268778),
('apm', 0.023224587127132566),
('actively adopting asset', 0.0241279733584699),
('asset performance', 0.025645678465285548),
('apm technology', 0.029995299363958845),
('technology', 0.04937113116851416),
('gas production', 0.055571440652063854),
('data', 0.05568651853043156),
('pharmaceuticals to oil', 0.057392310280758474),
('oil and gas', 0.057392310280758474),
('actively adopting', 0.057392310280758474),
('’re', 0.05870376376363502),
('assets', 0.06046162516715442),
('myth', 0.06051908857502686),
('understand', 0.06485308801630693),
('management', 0.06736024901319339),
('performance', 0.06892642785642587),
('asset', 0.0725539502005853)]
The text was updated successfully, but these errors were encountered: