2026
2026
2026
Large Scale Text Classification Using Map Reduce and Naive Bayes Algorithm for Domain Specified Ontology Building
(Joan, Eko)
Our preprocessing in this research consists of 4 processes. We can see detail of our preprocessing process in
figure 2. First process is about content extraction. This step is used to extract website content by removing unimportant
part in web page. We use some handmade rule in this web page extraction process.
Our classification process is divided into two phases. First phase is learning from annotated document to build
Second process is tokenization. We use this process to break up document into token such as word, number, or
punctuation. We develop some regular expressions to help our tokenization process. Third process is stopword removal.
Some common words that usually appear in document will be removed using some dictionary list from [15]. Fourth or
last process in our preprocessing phase is stemming. This process is used to map different morphological variant of
word into base word. We use Indonesian Language Stemmer in [15] for our stemming process. We implement this
preprocessing algorithm using map reduce in algorithm 1 and 2. A.
Text Mining in Healthcare for Disease Classification using Machine Learning Algorithm (Ghulam, Prof)
Collection of Data This study uses medical record data from the general hospital dr. Soetomo Surabaya. The
data period used in this study was from February 1, 2017, to February 1, 2019, with a total of 2271 medical records with
19 disease categories.
B. Methodology The methodology proposed in this study consists of several stages, namely the Collection Data,
Preprocessing Data, Classification Processes, and Evaluation Results. Figure 1. is a picture of the methodology proposed
in this study.
This study uses Electronic Medical Record (EMR) data from the general hospital dr. Soetomo Surabaya. The data
is retrieved from the Database in the form of unstructured text data. The total of 2271 records of text as patient’s
subjective symptoms with 19 disease categories were analyzed. The data period used in this study was from February 1,
2017, to February 1, 2019.
Transfer Learning Approaches for Indonesian Biomedical Entity Recognition (Diana, Safitri)
Two datasets are used in the experiments named Medmentions1 and IDMedmentions (Fig. 1). Medmentions is
a corpus gathered from PubMed which consist of abstact section of biomedical papers and annotated by experts. To
standardize annotation format, UMLS tags set is used. Whole datasets consists of 4,392 documents, 579,839 annotated
tokens (132 tokens / doc) and reported to achieve annotation precision score of 97.3%. Raw dataset consists of original
documents, abstract and biomedical mentions along with its boundaries. We wrote python scripts to extracts mentions
from document and put it into BIO (Beginning, Inside and Outside) format. Biomedical mentions are tagged with B-tag
All types of disorders, including symptoms, abnormalities, and diseases Medical and diagnostic procedures.
Including preventive, research, and educational activity Chemical and drug substances, including biological and dental
substances, enzyme, and hormone Anatomical structure, including body anatomy, body parts, body system and cell
Sleep well, avoid pollution, circumcision, brain surgery Blood, bilirubin, penicillin, urine
Eye, skin, reproductive system, digestive system, urinary glands and I-tag, while irrelevant mentions are tagged
with O. As an example “Knee injury may happened” is tagged with “BDISO I-DISO O O”.
There are about 3,000,000 tags registered in UMLS data bank which make model development is not
straightforward if we use the whole tags. Therefore, Llanos et al use UMLS semantics group to categorize the tags in
simpler form which are DISO (disorders), PROC (procedure), CHEM (chemical) and ANAT (anatomy) [19]. We follow this
approach since it will reduce computation cost without omitting original information. Descriptions of each tag is
presented in Table 1 and description for each group is based on original UMLS research [20].
IDMedmentions is biomedical raw dataset gathered from public health forum (alodokter.com) during 2016-
2021. Scrapy library was used as scraping tools to retrieve and parse questions, answers, and relevant document
metadata. After exploratory analysis we found that questions data contain much informal language while answers data
is the opposite. Therefore, we focused to only extract answers from general physician. Total data gathered are 501,342
documents, however we only annotated 2069 data as training and testing data. To annotate IDMedmentions we employ
medical students which produce 43,973 annotated tokens (21 tokens / doc) using NLPMEDTERM guideline [19].
NLPMEDTERM guideline contains rules, inclusion, and exclusion criteria to annotate biomedical data. Later, this dataset
will be divided into 2016 training and 53 test data later. Each token and its corresponding tags in both datasets are
separated by tab delimiter.
Automatic Assessment of Answers to Mathematics Stories Question Based on Tree Matching and Random Forest
(Selvia, Yuhana)
This research carries out a system that can solve math story problems automatically. Starting with pre-
processing the story questions then classifying them according to the operator used using the best classification
method. We investigated Random Forest and Support Vector Machine (SVM) to find the best performance. After that,
build a tree that represents each operand and operator in the problem and then solve the problem. At the same time,
the system can also show the competence of elementary
Figure. 5 Method school students based on their answers when solving math story problems. Questions will be
taken in the form of images which will be entered into the system using the OpenCV1 and Tesseract2 libraries.
Then, the same as before, a tree will be formed based on the results of the student's answers and will be
compared with the tree that has been generated by the previous classification. Comparison of these two trees is done
from the left child then proceed to the parent. The result of the comparison of these two trees is that the answers from
students will be assessed based on the stages. So, it is not only the final answer that is judged but also every step in the
process. The test will be carried out in three scenarios, namely the classification method test, the matching tree result
test, and finally the student answer value test based on ground-truth.
Several research questions were prepared, including: How to generate answers automatically from math story
problems? How does the Random Forest method perform in classifying operations in story problems based on
accuracy? How to form a tree that is used as a comparison between the results of the system and student answers? and
how to measure the performance of the proposed system?
Furthermore, the research is limited to the following things: a) story questions with a maximum number of
arithmetic operations of two operations. b) story questions with the number of variables that are not known or asked as
much as one variable. c) questions that use basic arithmetic operations (+, -, ÷, ×). d) The dataset used is text data and
image data with a pixel size of 512 x 512. e) story questions that have a level of difficulty according to Elementary School
Grades 1 to 3.
This section shows the process of how the flow of how math word questions are processed into a tree form
which will later be used as an indicator of student competency achievement. for the flow of the method can be seen in
Fig. 5.
The process begins by providing a dataset in the form of a math word problem that will go through two
processes. The first branch is classification. Based on experiment with Random Forest and SVM we found that Random
Forest showed the best performance. Therefor we use the Random Forest method to determine the number of
operations contained in each part of the story. Then a tree will be created which is taken from how to solve the story
problem. This tree is structured based on conditional sentences that contain rules in order of mathematical operation.
While the process in the second branch is to give the math word question to students, then a scanner will be
used to take answers from students and convert them into text using the Tesseract library. From the students' answers,
a tree will be formed which will then be compared with the tree from the first branch. The result will show how
students answer the question based on its mathematical order.
Graph Model and Deep Learning for Topic Labels in Classifying Short Texts of Scientific Article Titles (Surya, Prof)
This work could be applied in a cold-start situation when a research management unit needs to record
researchers and their expertise as assets. For avoiding the hassle of requesting lists of published scientific articles from
each researcher, our framework solution starts with data collection, i.e., scraping publicly available article metadata
from
Google scholar of selected researchers. This work investigated the researchers who tend to be actively involved
in research activities. Those researchers are mostly inclined to publish scientific articles and apply for national research
grants. The selected researchers would have an affiliation to top state universities, or in our case, the ones who have
legal entity status from the Indonesian ministry of research, technology and higher education.
Our dataset has 3,900 researchers included in the receiver lists of national research grants between 2018 and
2020. Researcher records could be catalogued at the national level, i.e., the Slovenian scientific system [13] who makes
top researchers known to promote a productive research environment.
Thus, there are an additional 500 researchers from the Indonesian scientific system called the science and
technology organic chemistry (5,729), chemistry (4,488), agriculture (3,999), microbiology (3,899), livestock (3,239),
nutrition-biotechnology (3,230), organic chemistry (6,821), chemistry (10,318), agriculture (4,868), microbiology (6,684),
livestock (6,541), nutrition-biotechnology (7,850), 11,204 education (5,112), natural resources (16.3%) (3371), science
education (2,721) 30,470 education (8,625), natural resources (22.1%) (10,294), science education (11,551) sample
keywords synthesis, acid, carbon, material characteristic, area patient, risk, factor, child province, society control,
optimization case, regency influence, management student, improve, review 2020 by those researchers. The article
titles are in English or Indonesian words since researchers could publish in national journals or international
conferences. Thus, the short texts were identified with LangDetect library before text pre-processing.
Several typical text pre-processing steps were applied on the short texts of those collected article texts, like
alphanumeric transformation, lower case changes before tokenization and stop word removal. stemming (Sastrawi
library) and lemmatization (NLTK library) are used to transform words into their basic form to reduce the size of
vocabulary space. Since the texts have no topic labels, some topic modelling methods [14] were observed, such as LDA
or its adjustment LDA mallet, latent semantic analysis (LSA) and hierarchical Dirichlet process (HDP) that infers the topic
number from the data.
Topic modelling tests with biterm topic model (BTM) regarding two words that occurred within the same
context or window were also examined. However, the memory condition makes BTM is not preferable.
Various term weighting schemes to measure the term importance as keywords like the typical term frequency–
inverse document frequency (TF-IDF) or LogEntropy model for normalizing the gap of TF values were also observed. We
also investigated the common embedding usage of GloVe to ensure a vector representation of a text is similarly
distributed in an n-dimensional space. We examined the results based on topic coherence as a performance indicator
for all combinations of topic modelling with the topic number ranging from 10...90 and term weighting schemes. After
comparing the topic coherence versus the processing time of those steps, our examination recommended LDA and TF-
IDF with 18 as the topic number and 0.6 as the topic coherence value.
LDA could give “topic membership” or the probability value for a text to have the same topic context, such that
LDA allows multi labels for the text. Since article titles have a limited number of words, usually around 15-20 words, and
become less after stop word removal, it would not be easy to associate a text to a topic. Thus as parts of word filtration
to reduce the vocabulary space and filter not too well-represented article titles from the dataset, our framework
suggested a single topic label from the highest probability value of LDA result with a minimum threshold of 0.063. The
word filtration decreases the training data from ±180K to ±109K titles. The 18 topic labels, as mentioned in Table 1, are
manually defined after observing some frequent words. In this manuscript, the sample keywords have been translated
into English terms.
Some topics could have related contexts, such that Table 1 shows eight topic groups. The numbers 0.71 of
article titles between those two periods demonstrate increasing works of researchers. Thus, a system to manage
researchers’ data could help profile them and be used for recommendations or visualizations, which would be a
valuable tool for an institution to manage, promote and support more funding on specific research topics.
Selecting keywords from texts is an important task to analyse the context for understanding their contents.
Word embedding of GloVe was also observed in our empirical analysis. The embedding usage gave better results, as
described in detail in the results and discussion.
Fuzzy C-Means and Social Network Analysis Combination for Better Understanding the Patient-based Spread of
Dengue Fever with Climate and Geographic Factors (Wiwik, Prof)
The data used in this investigation includes information about DF patients and weather information such as air
temperature, humidity, rainfall, and wind speed. Furthermore, data on geographic circumstances in the form of an
area’s elevation is involved. Dengue patient’s data, a daily report from January 2018 to December 2019, is obtained
from the Malang district health office. The altitude data of the region is the height above sea level for each sub-district.
This altitude data is collected from the central bureau of statistics. As for data on air temperature, humidity, rainfall, and
wind speed obtained by the meteorology, climatology, and geophysics agency of Karangploso. All data used have the
same period from January 2018 to December 2019.
PREDICTION OF OSTEOARTHRITIS USING LINIER VECTOR QUANTIZATION BASED TEXTURE FEATURE (Lilik, Prof)
This study used x-ray images of knees as the data obtained from Osteoarthritis Initiative (OAI). The method for
conducting an x-ray approached to fixed-flexion PA view (see Figure 1). Moreover, Figure 2, which also (b) became one
of data that would be processed in this study, portrayed an example of KL-Grade 0 to KL-Grade 4. The data, further,
were classified into two categories namely data for learning and (c) testing. The total data for learning category were
five for each KL-Grade and 499 data were for testing category.
(e) Figure 2. Data Junction Space Area (JSA) (a) grade 0 (b) grade 1(c) grade 2 (d) grade 3