262x Filetype PDF File size 0.98 MB Source: www.ijeat.org
International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249-8958 (Online), Volume-10 Issue-1, October 2020
Automatic Pre-Processing of Marathi Text for
Summarization
Apurva D. Dhawale, Sonali B. Kulkarni, Vaishali M. Kumbhakarna
Abstract: The text summarization is a technique where the
original large text is condensed into smaller version without To deal with this dilemma, automatic text summarization
changing its abstract meaning. The text summarization is done plays a vital role. Automatic summarization condenses a
on the common foreign and regional languages typically, but source document into meaningful content which reflects
infrequent work has been observed for the Marathi language. As main thought in the document without altering information
the amount of e-contents on web is increasing drastically, the [13].There are distinctive automatic text summarization
users are facing difficulty to read the newspaper articles with systems existing for mostof the regularly used natural
extraction of its different perspectives with sorting. We are languages. [4] The Text summarization methods can be
focussing on educational, Political and sports news for categorized by the way it is done. The approaches mainly
summarization, which will be helpful for students who are include single document, multi document, monolingual,
appearing for competitive exams. This paper explores the pre- multi lingual, generic, query based, indicative, informative
processing techniques for Marathi e-news articles. summary.[14] These methods are used for numerous foreign
Keywords: Text summarization, POS tagging, Pre-processing,
LDA(Latent Dirichlet Allocation), LNS (Label Induction and Indian languages all over world. As we are focussing on
Grouping), SVM (Support Vector Machine) Marathi language, which is the regional language of
Maharashtra the following work has been done in recent
I. INTRODUCTION years: Mr. Shubham Bhosale, Ms. Diksha Joshi, Ms.
Summarization is defined as the extraction of features VrushaliBhise, Prof.Rushali A. Deshmukh [1] proposed a
of text document and generating abstract with same system for Marathi newspaper text summarization using
meaning. [1] To have an access to reliable and accurate data, Ranking algorithm which gives average of 30% to 40 % size
user needs to implement a very potent system which will of original article. Anishka Chaudhari1, Akash Dole2,
give best results. The summarization of text is an interesting Deepali Kadam, proposed a system which translates Marathi
st
area where people of 21 century would be relying for time dataset to English using Google Translate API and then
saving, accuracy, & reduced efforts for reading the whole summarizes news articles using a bi-directional encoder-
document. There are many prominent languages on which decoder LSTM model. The resultant summary is again
the work has been done in the area of text summarization. translated to Marathi using Google Translate API.[5] Pooja
But today the need for regional language text summarization Bolaj,SharvariGovilkar[2] developed a text classification
is very much obligatory. Keeping this in mind, the work for system for Marathi documents using supervised learning
regional languages in Maharashtra has been reviewed, methods & ontology based classification technique which
where the Marathi Language is a bit less focussed. The classifies Marathi documents belonging to Festival class i.e.
literature for Marathi Language text summarization shows Diwali. Deepali K. Gaikwad, Deepali Sawane and C.
that there is no observed powerful tool, or system which Namrata Mahender, seveloped a system for rule Based
gives high efficiency in summarizing Marathi text.Soit’s Question Generation for Marathi Text Summarization using
needed to focus on the Marathi language text Rule Based Stemmer. The paper shows technique which is
summarization. There are two major steps through which the used for generation of the appropriate question on given
text goes for the efficient output, a) Pre-processing&b) input/text.[6] Yogeshwari V. Rathod [7] used sentence
. [3]
Processing ranking algorithm to generate summary of Marathi news
II. LITERATURE STUDY articles by extractive method. It gives effective summary in
less time and with least redundancy. Shraddha A. Narhari,
To find appropriate information, a user needs to RajashreeShedge [8] proposed a text categorization of
search through the entire documents this causes information Marathi documents using LINGO & PCA algorithm. They
overload problem which leads to wastage of time and proved this with improved results. Jaydeep Jalindar Patil,
efforts, and this happens when user queries for information Prof. NagarajuBogiri[9] used LINGO [Label Induction
on the internet he may get thousands of result documents Grouping] algorithmfor improving results efficiently
which may not necessarily relevant to his concern. inmarathi text documents. Prakhar Sethi, Sameer Sonawane,
SaumitraKhanwalker, R. B. Keskar [10] developed a system
to Overcome the limitations of the lexical chain approach to
Revised Manuscript Received on October 10, 2020. generate a good summaryusing the WordNet thesaurus,
* Correspondence Author pronoun resolution for news articles. N. Dangre, A. Bodke,
Ms. Apurva D. Dhawale*, Department of Computer Science, Dr. A. Date, S. Rungta, S.S. Pathak [11] proposed a System for
Babasaheb Ambedkar Marathwada University, Aurangabad, India. Marathi News Clustering using Cluster algorithm to collect
Dr. Sonali B. Kulkarni, Completed her Master of Science, Dr. relevant Marathi news from multiple sources on web which
Babasaheb Ambedkar Marathwada University, Aurangabad, India
Ms. Vaishali M. Kumbhakarna, Completed Master of Science, Dr.
Babasaheb Ambedkar Marathwada University, Aurangabad, India
© The Authors. Published by Blue Eyes Intelligence Engineering and
Sciences Publication (BEIESP). This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Retrieval Number: 100.1/ijeat.A18031010120 Published By:
DOI:10.35940/ijeat.A1803.1010120 Blue Eyes Intelligence Engineering
Journal Website: www.ijeat.org 230 and Sciences Publication
© Copyright: All rights reserved.
Automatic Pre-Processing of Marathi Text for Summarization
results in enabling rich exploration of Marathi contents on Supervised Learning Method, Clustering, lexical chain,
web. Mamatha Balipa, Dr. Balasubramani R, Harolin Vaz, domain specific summarization algorithms.[12] Sheetal
Christina Shilpa Jathanna, attempted summarizing Shimpikar, Sharvari Govilkar, worked on approach which
information from online health care forums about the takes Marathi documents as input text. The first step is pre-
disease Psoriasis to implement automatic text processing of the input text & used rich semantic graph
summarization. Online text is extracted using BeautifulSoup method. They proved that the Rich Semantic Graph based
class available in urllib2 module. method gives the correct, bug free result.[16]
Then the topic of the text is confirmed to be Psoriasis by In a nation like India there are 22 languages spoken,
using Latent Dirichlet Allocation (LDA) algorithm.[20] which are written in 13 different scripts, with about 720
Chirantana Mallick, Ajit Kumar Das, Madhurima Dutta, dialects. Taking this into consideration developing a nation-
Asit Kumar Das and Apurba Sarkar, proposed a method wide summarization tool for India would be a very difficult
which constructs a graph with sentences as the nodes and problem. Jovi D’silva, Dr.Uzzal Sharma examined
similarity between two sentences as the weight of the edge approaches to this problem and also highlight some existing
between them.[21] Reda Elbarougy, Gamal Behery, Akram research that has been done in Indian languages. They
El Khatib, applied modified page rank algorithm with an proved a language independent approach for text
initial score for each node that is the number of nouns in this summarization can prove to be enormously constructive as
sentence. More nouns in the sentence mean more the algorithm would have the potential to create summaries
information, so nouns count used here as initial rank for the irrespective of the language of the input text.[17] Poonam
sentence. Edges between sentences are the cosine similarity Kolhe, Prof. Ashish Kumbhare, designed an algorithm that
between the sentences, to get a final summary that contains can recognize the action word by abstraction and summarize
sentences with more information and well connected with the input document by extraction and attempting to modify
each other. [22] Ahmed Elrefaiy, Ahmed Rafat Abas, this extraction using a NLP tools like WordNet.[18]
Ibrahim Elhenawy, provided a review of collaborative Umakant Dakulge, S. C. Dharmadhikari,proposed a
survey which focuses on unsupervised techniques. It also framework which summarizes a single document using
describes evaluation of techniques of the summaries.[23] extraction method. The TF-ISF, sentence length, sentence
Rasim Alguliev, Ramiz Aliguliyev, shown an approach positional value, SOV verification are used to make the
which can improve the performance compared to sate-of- summary more relevant and precise. [19] In this research,
the-art summarization approaches. They have proposed new we are using extractive based approach using Text ranking
criterion functions for sentence clustering. They also have algorithm where the document is read first, its length is
developed modified discrete differential evolution algorithm calculated, and it would generate a summary which gives us
to optimize the objective functions.[24] Kalliath Abdul important sentences according to the requirement of the
Rasheed Issam, Shivam Patel, Subalalitha C. N., proposed user. The relevant literature shows that there are many
technique which aims to capture all the varied information methods & algorithms suitable for Text processing and text
present in source documents. Also they have discovered that summarization as the digital text is gaining importance day
their model produces encouraging ROUGE results and by day. The result may vary depending on the language
summaries when compared to the other published extractive chosen and the selected algorithm.
and abstractive text summarization models. [25] Siddhant Marathi is considered as an Indo-Aryan language.
Upasani, Noorul Amin, Sahil Damania, Ayush Jadhav, A. The people of Maharashtra speak this language primarily.
M. Jagtap, obtained the rank or score of each sentence and Marathi is morphologically rich so the classification of text
the sentences with the rank above a particular value can be gets very difficult. [2] The steps below show the pre-
chosen to be included in the summary.[26] Yash Asawa, processing of Marathi news article using python.
Vignesh Balaji, Ishan Isaac Dey, surveyed numerous Input Text
approaches, merits and limitations of the techniques of
summarization. The Benchmark datasets of this domain and Calculate Length
their features have also been examined. [27]
III. PROPOSED SYSTEM Tokenization[Split Text]
There are multiple types of text summarization Remove special symbols
which includes bilingual, multilingual, single document,
multi document text summarization wherethe categories can
be: 1] Foreign Language & 2] Indian language. Literature Count Frequency of words
survey in the paper shows that the Foreign language text
summarization is done using sentence ranking, deep Forming Key-Value Pairs
learning, word frequency and distribution, fuzzy inference
system, rule based, Genetic algorithm, LDA (Latent Fig.1. Pre-processing of Marathi news article
Dirichlet Allocation), Random Indexing and page rank
algorithms. Indian Language text summarization is sone A. INPUT TEXT
using Scoring of sentences, ROUGE evaluation toolkit, Sub The first step for text processing is input the text or
graph, Language-Neutral Syntax (LNS), Support Vector paragraph for summarization. The input text may contain
Machine (SVM) classifier, hybrid algorithm, Bernoulli words,
Model of Randomness algorithms. [12] Here we are
focussing on the Marathi text processing which can be done
by using several algorithms which areText ranking, LINGO,
Retrieval Number: 100.1/ijeat.A18031010120 Published By:
DOI:10.35940/ijeat.A1803.1010120 Blue Eyes Intelligence Engineering
Journal Website: www.ijeat.org 231 and Sciences Publication
© Copyright: All rights reserved.
International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249-8958 (Online), Volume-10 Issue-1, October 2020
sentences or paragraphs. The validity of text is checked and used Text.Replace()function, which searches for the special
if there are some words or sentences which are not in characters first and replaces them with white spaces.
Marathi language, they are eliminated from the document
for char in ' “ ” " "‘ ’ ~ `, / ? ' '[ ] { } : ; \ | ~ ! @
and then it is sent for further processing.
# $ % ^ & * ( ) _ - = + <>\n ':
mytext= """ ' ' ( )
Text= mytext.replace(char , ' ')
,
" )
.
. ,
.
' ' cbse.nic.in .
cbseresults.nic.in .
- -
' '
. - , -
. . "
We Have to count frequency of each word because the
. """ irrelevant words i.e. An empty array is created for storing
the count; to calculate this frequency count get () function is
used and counter will help to get exact count of each word
B. PRE-PROCESSING then.
In Natural Language Processing(NLP), one of the
important and traditional step is to pre-process the input
for word in word_list:
text. It transforms the text in more comprehensible form by
d[word]= d.get(word,0)+1
output:
which the machine learning algorithms work well with text.
Basically, the unstructured data is turned into structured one.
': 1, . . . .
If we do not apply pre-processing then data would be very
inconsistent andcould not generate good analytics
results.[15] Here we are installing Python Libraries The Key Value pairs are formed then for feature vector. It
which work with NLP & Information retrieval for our gives a list of words and its frequency count in front of that
system. The python libraries are commonly used to get word as shown in the following figure, this step gives
improved performance of the system. After inputting the feature vector for the input document.
text, length is calculated using ‘len’ function.
for key, value in d.items():
# Length of text
word_freq.append({value,key})
len(mytext) Output:
output: 607 ", 1},
'},
'},
"},
word_list=mytext.split() )', 1},
'},
'},
'},
}
}
'},
{3, ' '},
'},
'},
- - .'},
'},
'},
'},
.'] '}…
The next step is tokenization,where the
sentences are broken into tokens. The process of
tokenization includes splitting the text, where Text.Split()
can be used and then the list of all the words is forwarded
for next step.
The further step in pre-processing is to remove special
characters or symbols in the tokenized document. These
characters are searched in the document, and for this we
Retrieval Number: 100.1/ijeat.A18031010120 Published By:
DOI:10.35940/ijeat.A1803.1010120 Blue Eyes Intelligence Engineering
Journal Website: www.ijeat.org 232 and Sciences Publication
© Copyright: All rights reserved.
Automatic Pre-Processing of Marathi Text for Summarization
IV. CONCLUSION Knowledge Management pp. 71–75.ICITKM, ISSN 2300-5963
ACSIS, Vol. 14, New Delhi, 2017.
There is a necessity that the regional language e-content 16. Sheetal Shimpikar, Sharvari Govilkar, “Abstractive Text
must be focussed for text summarization. This paper gives a Summarization using Rich Semantic Graph for Marathi Sentence”,
spotlight on the regional language of Maharashtra i.e. JASC: Journal of Applied Science and Computations Volume V, Issue
Marathi. The tools used for processing the Marathi text are XII, ISSN NO: 1076-5131, December/2018.
17. Jovi D’silva, Dr.Uzzal Sharma, “Automatic Text Summarization Of
in a way effectual, because the efficacy changes depending Indian Languages: A Multilingual Problem”, Journal of Theoretical
on the language and tools used for text summarization. The and Applied Information Technology Vol.97. No 11, 15th June 2019.
paper highlights the flow of pre-processing by which the 18. Poonam Kolhe, Prof. Ashish Kumbhare, “Optimizing Accuracy of
Marathi text goes for summarization. In first step, the input Document Summarization Using Rule Mining”, International Journal
of Computer Science and Mobile Computing, Vol.6 Issue.6, pg. 207-
file is extracted, then the length of text is 216, June- 2017.
calculated,tokenization is performed, end of the sentence is 19. Umakant Dakulge, S. C. Dharmadhikari, “Automated Text
calculated, special symbols are removed, then the frequency Summarization: A Case Study for Marathi Language”, Data Mining
count of the word is taken as a statistical value and key and Knowledge Engineering, CIIT, Vol 6, No 3 (2014).
20. Mamatha Balipa, Dr. Balasubramani R, Harolin Vaz, Christina Shilpa
value pairs are formed for further processing. We are trying Jathanna, “Text Summarization For Psoriasis Of Text Extracted From
to develop a system which is comparatively more capable Online Health Forums Using Textrank Algorithm”, International
and efficient for summarizing Marathi e-News. Journal Of Engineering & Technology, 7 (3.34) (2018) 872-873, 18
September 2018.
21. Chirantana Mallick, Ajit Kumar Das, Madhurima Dutta, Asit Kumar
REFERENCES Das And Apurba Sarkar, “Graph-Based Text Summarization Using
1. Mr. Shubham Bhosale, Ms. Diksha Joshi, Ms. VrushaliBhise, Modified Textrank”, J. Nayak Et Al. (Eds.), Soft Computing In Data
Prof.Rushali A. Deshmukh, “Marathi e-Newspaper Text Analytics, Advances In Intelligent Systems And Computing 758,
Springer Nature Singapore Pte Ltd. 2019.
Summarization Using Automatic Keyword Extraction Technique”, 22. 10] Reda Elbarougy, Gamal Behery, Akram El Khatib, “Extractive
International Journal of Advance Engineering and Research
Development Volume 5, Issue 03, March -2018. Arabic Text Summarization Using Modified Pagerank Algorithm”,
2. Pooja Bolaj, SharvariGovilkar, “Text Classification for Marathi Egyptian Informatics Journal 21, 73–81, Science Direct, Elsevier,
(2020).
Documents using Supervised Learning Methods”, International Journal 23. Ahmed Elrefaiy, Ahmed Rafat Abas, Ibrahim Elhenawy, “Review Of
of Computer Applications (0975 – 8887), Volume 155 – No 8, Recent Techniques For Extractive Text Summarization”, Journal Of
December 2016. Theoretical And Applied Information Technology 15th December
3. Virat V. Giri, Dr.M.M. Math and Dr.U.P. Kulkarni, “A Survey of 2018. Vol.96. No 23, Issn: 1992-8645, Jatit & Lls, 2005.
Automatic Text Summarization System for Different Regional 24. Rasim Alguliev, Ramiz Aliguliyev, “Evolutionary Algorithm for
Language in India”, Bonfring International Journal of Software
Engineering and Soft Computing, Vol. 6, Special Issue, October 2016. Extractive Text Summarization”, Intelligent Information Management,
4. Prof. Satish Kamble, ShivlilaMandage,ShubhangiTopale, 1, 128-138, Scientific Research, SciRes, 2009.
DipaliVagare, PreranaBabbar, “Survey on Summarization Techniques 25. Kalliath Abdul Rasheed Issam, Shivam Patel, Subalalitha C. N., “Topic
Modeling Based Extractive Text Summarization”, International Journal
and Existing Work”, International Journal of Applied Engineering of Innovative Technology and Exploring Engineering (IJITEE) ISSN:
Research ISSN 0973-4562 Volume 12, Number 1 (2017). 2278-3075, Volume-9 Issue-6, April 2020.
5. Anishka Chaudhari1, Akash Dole2, Deepali Kadam3, “Marathi text 26. Siddhant Upasani, Noorul Amin, Sahil Damania, Ayush Jadhav, A. M.
summarization using neural networks”, International Journal of
Advance Research and Development, Volume 4, Issue 11, 2019. Jagtap, “Automatic Summary Generation using TextRank based
6. Deepali K. Gaikwad, Deepali Sawane and C. Namrata Mahender, Extractive Text Summarization Technique”, International Research
Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056,
“Rule Based Question Generation for Marathi Text Summarization Volume: 07 Issue: 05 May 2020.
using Rule Based Stemmer”, IOSR Journal of Computer Engineering 27. Yash Asawa, Vignesh Balaji, Ishan Isaac Dey, “Modern Multi-
(IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, PP 51-54, 2018.
7. Yogeshwari V. Rathod,“Extractive Text Summarization of Marathi Document Text Summarization Techniques”, International Journal of
News Articles”, International Research Journal of Engineering and Recent Technology and Engineering (IJRTE) ISSN: 2277-3878,
Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 07,July Volume-9 Issue-1, May 2020.
2018.
8. Shraddha A. Narhari, RajashreeShedge, “Text Categorization of AUTHORS PROFILE
Marathi Documents using Modified LINGO”, IEEE, 2017
9. Jaydeep Jalindar Patil, Prof. NagarajuBogiri, “Automatic Text Ms. Apurva D. Dhawale completed M.phil in
Categorization-Marathi documents”, International Conference on Computer Science in 2015 from Dr.Babasaheb
Energy Systems and Applications (ICESA 2015), IEEE, 2015. Ambedkar Marathwada University, Aurangabad,
10. Prakhar Sethi, Sameer Sonawane, SaumitraKhanwalker, R. B. Keskar, India. Currently she is pursuing her Ph.D. in
Computer Science from Dr.Babasaheb Ambedkar
“Automatic Text Summarization of News Articles”, International Marathwada University, Aurangabad, India. She
Conference on Big Data, IoT and Data Science (BID) Vishwakarma
Institute of Technology, Pune, Dec 20-22, IEEE, 2017 has 9 years of teaching experience in Dr. G. Y.
11. N. Dangre, A. Bodke, A. Date, S. Rungta, S.S. Pathak, “System for Pathrikar College of CS &IT, MGM University,
Aurangabad and published 9 papers reputed international journals including
Marathi news clustering”, 2nd International conference on Intelligent Scopus, Elsevier, Springer. Her research interest areas are Natural
computing,communication & convergence, bhubaneshwar,
ELSEVIER, 2016. Language Processing & Biometric Image Processing.
12. Apurva D. Dhawale, Sonali B. Kulkarni, Vaishali Kumbhakarna,
Dr. Sonali B Kulkarni Completed her Master of
“Survey of Progressive Era of Text Summarization for Indian and Science from Dr.Babasaheb Ambedkar
Foreign Languages Using Natural Language Processing”, ICIDCA Marathwada University, Aurangabad, India with
2019, LNDECT 46, pp. 654–662, Springer Nature Switzerland, AG,
2020. First in the order of merit in year 2002.She has
13. E. Lloret and M. Palomar, “Text summarization in progress: a also completed Ph.D in Computer Science from
literature review,” in Springer, no. April 2011, pp. 1–41, Springer, Dr.BAMUniveristy, Aurangabad and currently
2012. working as Assistant Professor in Department of
14. Tarun B. Mirani and SreelaSasi, “Two-level Text Summarization from Computer Science and IT,
Online News Sources with Sentiment Analysis”, International
Conference on Networks & Advances in Computational Technologies
(NetACT) ,20-22 July 2017, Trivandrum, IEEE, 2017.
15. Vaishali Kalra, Dr. Rashmi Aggarwal, “Importance of Text Data
Preprocessing& Implementation in RapidMiner”, Proceedings of the
First International Conference on Information Technology and
Retrieval Number: 100.1/ijeat.A18031010120 Published By:
DOI:10.35940/ijeat.A1803.1010120 Blue Eyes Intelligence Engineering
Journal Website: www.ijeat.org 233 and Sciences Publication
© Copyright: All rights reserved.
no reviews yet
Please Login to review.