data
Blanks represent unknown for the case no statements are made in articles regarding each category. Otherwise whether it was considered or how it was considered are stated in this table.
* SL: single lag, MA: moving average, DL: distribute lag, AR: auto-regressive term
Study locations.
Region | Countries | Number of studies (n = 33) |
---|---|---|
Africa | Burundi, Ethiopia, Kenya, Niger, Malawi, Rwanda, Tanzania, Uganda, Zambia | 8 |
East Asia | China, Taiwan, Korea | 5 |
Southeast Asia | Thailand, Vietnam, Singapore | 6 |
South Asia | India, Bangladesh | 8 |
Central/South America | Peru, Puerto Rico, Brazil | 5 |
Oceania | Australia | 1 |
The counts for outcome diseases of interest used in the studies were mostly in the time unit of weeks and months (29 studies). Daily and yearly counts were not as common, being only 5 and 1 studies respectively (Table (Table3 3 ).
Summary of modelling characteristics
Number of studies (n = 33) | |
---|---|
Unit of outcome data | |
Daily | 3 |
Weekly (including bi-weekly) | 13 |
Monthly (including bi-monthly) | 16 |
Yearly | 1 |
Regression models | |
GLM (Poisson, quasi-Poisson, negative binomial) | 28 |
GAM (Poisson, negative binomial) | 3 |
Mixed models | 2 |
Control of seasonality and long term trend | |
Some adjustments were included in the model | 25 |
No adjustments / not described | 8 |
Autocorrelation | |
Examined / included parameters to control autocorrelation | 21 |
No specific measures / not described | 12 |
Lag effects of exposure | |
Lag effects of whether variables were assessed | 28 |
No lag effect assessments | 5 |
As specified in the review criteria, the regression models were GLM and GAM with different distribution models, i.e. Poisson, quasi-Poisson, and negative binomial (31 studies). The other two studies integrated mixed models. Among the studies, 18 used models allowing for overdispersion, if any, by inclusion of an overdispersion parameter or selection of different distribution models (e.g. quasi-Poisson or negative binomial).
As mentioned above, an adjustment of seasonal variation and long-term trend is part of the standard approach in the typical time-series regression. In our review, 25 of the 33 studies (76%) included terms in models that allow for seasonality and trends with natural spline functions on time, trigonometric functions, or month and year indicator variables. Other than adjustments for cyclic seasonality and long term trend effects, more than half of the reviewed studies commonly indicated considerations or attempts to control autocorrelation (21 studies). Autocorrelation adjustments may have been necessary because time series are generally subjected to high autocorrelation caused by serial correlations between observations close in time distance. In those 21 studies, the most popular method for autocorrelation controls was to incorporate autoregressive terms including lagged outcome values, the logarithm of lagged outcome values, and lagged model residuals (19 studies).
Other covariates were also included in many studies, including spatial factors if studies involved different geographical areas, population number, risk related index, and holiday indicators. In risk assessments of exposure factors, time lag effects were considered in the majority of the reviewed studies (28 studies). However, we found that the analyzed lag forms (i.e. single lag, moving average lag, or distributed lag) and the time length of lag varied by study regardless of the same targeted disease. While evaluated lag lengths were, if predetermined, often supported by literature reviews and biological plausibility, many did not provide the rationales of assessed lag lengths. In some exploratory studies, on the other hand, long lag lengths were investigated to observe the thorough exposure effects over time. Another finding in our review was, even though infectious diseases generally confer temporary or permanent immunity, the susceptible or immune population was rarely addressed in study models. No studies computed or integrated the estimated susceptible population, and a few studies instead included proxies (e.g. vaccination rate) to account for the target population’s susceptible risk.
While time series analysis with GLMs or GAMs is the established method in environmental epidemiology research, our review brings attention to several potential issues when the same application of the traditional approach for non-infectious diseases extends to infectious diseases.
First, immune protection, which is one of the unique features of infectious diseases, can lead to rapid changes in the underlying population at risk over the course of the study period, but few studies have addressed the susceptible or immune population in their models. The information on immune population can be critical as host immune competence (intrinsic factor) and environmental (extrinsic) factors are both important contributors to seasonal disease activity [ 41 ]. In particular, the importance of the interplay of intrinsic and extrinsic factors is illustrated in one cholera study in which the developments of outbreaks is unsuccessful, even with the disease’s favorable environmental conditions when the susceptible population is small [ 42 ]. The consequence of not taking into account the susceptible population in a model is the misquantification of the effects of environmental exposures. However, since estimates of immune or susceptible individuals within a population seldom exist in data, it is often necessary to create alternative measures to increase the precision of the analysis. The alternative approaches may include, but are not limited to, reconstructing estimation of susceptible population by deterministic models (e.g. susceptible-infected-recovered models) and proxy indicators such as vaccination rates.
Secondly, while adjustments for seasonal variations and long term trends were common, one third of the reviewed articles did not include the adjustment measures in their models. The reason is unknown, yet one possible reason might be less apparent seasonal variations of disease activity. For instance, while in temperate climate regions have epidemics of influenza on a regular basis in winter time, malaria often presents a less obvious periodic pattern of seasonality. In general, adjustments for seasonality variation in the traditional time series analysis involve two important meanings, i.e. elimination of the effects of unknown time-varying covariates and realization of the regression assumption of independence. Realization of the independence assumption is a particularly important underlying regression hypothesis for time series analysis, because observations of a variable that are close in time tend to be similar and are generally correlated (i.e. autocorrelation) [ 1 ]. When seasonality is absent in the outcome data at a glance, the question may naturally arise whether there is any necessity to implement seasonal adjustments in a model. However, given the possibility of serial correlations that may naturally exist in time series data, the question of whether to include seasonal adjustments should be carefully examined using statistical validations (e.g. model fitness and residuals).
Another concern regarding autocorrelations arises when the magnitude of strength and the potential underlying cause are considered. In our literature review, inclusion of autoregressive terms in addition to seasonal adjustments to control autocorrelation was commonly observed (19 studies), which, for one reason, may imply that the adjustment of seasonality variation alone is not sufficient. In general, an imperfect control of autocorrelation suggests omissions of other significant time-varying covariates from a model [ 43 ]. However, given the characteristics of infectious diseases, a stronger autocorrelation than controlled seasonality may be induced by the actual correlation in outcome observations due to disease transmissions among individuals. In other words, the true dependence among neighboring observations can be present with infectious disease data because the number of newly infected individuals depends on the number of previously infected individuals in the population. In fact, some studies [ 15 , 16 ] included autoregressive terms (e.g. a lagged outcome or logarithm of lagged outcome) to account for the dependency of infectious diseases data. This correlation is also known as “true contagion” [ 44 ], and the resulting violation of the assumption of independence will cause biases not in the regression coefficients but in the estimates of standard errors [ 43 ]. Thus, the discussion again returns to the importance of implementing adequate seasonality adjustments with statistical validations and the need for additional measures if autocorrelation in model residuals remains. In order to competently address the autocorrelation resulting from true contagion or transmissibility of infectious diseases, it might be worthwhile in the future to explore what approaches are not only statistically effective but also biologically compelling from the aspect of disease mechanisms.
Thirdly, in the process of estimating lag effects of exposure factors, the lag timings evaluated varied by studies in spite of the same targeted disease. This may be because the quantitative evidences needed to establish the optimal lag timings remains elusive with most diseases, although there might be qualitatively convincing ideas. The difficulty of estimating the optimal lag times may be especially severe in vector-borne diseases. In these diseases, the transmission mechanisms become highly complicated due to the intermediating effects of vectors which influence the strong disease seasonality [ 45 ], but they can also be highly content-dependent. For instance, the association patterns and lags of rainfall effects in malaria vary widely by region and climate conditions (e.g. whether the region is generally dry or has abundant rain) [ 46 ]. More importantly, however, time lags and association patterns can be more complicated in infectious diseases than non-infectious diseases because the mechanism of disease manifestation (e.g. incubation period) and the transmission dynamics of pathogenic microorganisms (e.g. bacteria, viruses, parasites, or fungi) play a critical role in the causal pathway. Therefore, an understanding of biological mechanisms can be of great help in estimating lags and association patterns. If no certain prior knowledge exists or complicated transmission pathways are expected, then strategic exploration approaches are required to find the optimal estimates.
Lastly, most of our reviewed studies conducted an analysis using weekly or monthly data (including bi-weekly and bi-monthly). Unlike non-infectious diseases, daily count outcomes were much less common. This relates to only certain infectious diseases, but it is worth noting that using the longer time unit of data may sometimes lead to an underestimation of risk factors when the optimal time lags of exposure effects and disease incubation periods are short (e.g. monthly data is used for analysis when the optimal exposure effects are expected in one week lag). Wherever possible, selection of the most statistically robust and biologically plausible time unit of data is desirable for analysis.
Our study has some limitations. The first is that, among all the diseases potentially linked to weather variability, only four diseases were selected for the review. As a result, we may have eliminated studies that could have delivered some insightful analytical approaches. In review of our aim to characterize the methodological trends, however, our selected diseases were probably sufficient because they consist of different types of infectious diseases including water-borne, vector-borne, and air-borne diseases. Another limitation is that GLMs and GAMs were the only targeted models, even though other methods such as autoregressive integrated moving average can also fall into the category of time series regression models. Those other time-series methods might have provided solutions for the concerns raised here, but we believe that we have looked at important issues in common with the above that deserve careful attention and awareness. In conclusion, the careful implementation of time series regression analysis is required in the study of environmental determinants of infectious diseases. Further studies are required to explore alternative models and to address methods that will improve the time series analysis.
We sincerely thank Ben Armstrong for his insights that formed the basis of this study.
None to declare.
Peer Review, Refereed, Indexed, Multidisciplinary, Multilanguage, Open Access Journal Call for Paper All Policy Paper Status
404 error: page not found.
Sorry, we can’t find the page you’re looking for. It might have been moved or deleted.
Here’s what you can do:
A comprehensive survey of data mining techniques on time series data for rainfall prediction, profit prediction using arima, sarima and lstm models in time series forecasting: a comparison, an integrated approach for flood prediction by using block chain network and machine learning, 30 references, time series data analysis for long term forecasting and scheduling of organizational resources – few cases, forecasting strong seasonal time series with artificial neural networks, time series analysis: forecasting and control, a feed-forward neural networks-based nonlinear autoregressive model for forecasting time series, comparison of short-term rainfall prediction models for real-time flood forecasting, a statistical method for forecasting rainfall over puerto rico, rainfall forecasting in space and time using a neural network, time series analysis, forecasting and control, rainfall forecasting using soft computing models and multivariate adaptive regression splines, how effective are neural networks at forecasting and prediction a review and evaluation, related papers.
Showing 1 through 3 of 0 Related Papers
Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser .
Enter the email address you signed up with and we'll email you a reset link.
In today's world there is ample opportunity to clout the numerous sources of time series data available for decision making. This time ordered data can be used to improve decision making if the data is converted to information and then into knowledge which is called knowledge discovery. Data Mining (DM) methods are being increasingly used in prediction with time series data, in addition to traditional statistical approaches. This paper presents a literature review of the use of DM and statistical approaches with time series data, focusing on weather prediction. This is an area that has been attracting a great deal of attention from researchers in the field.
Nashra Javed
IRJMETS Publication
International Research Journal of Modernization in Engineering Technology and Science (IRJMETS)
Climate change is becoming a serious impact nowadays on the environment. Climate change refers to extreme changes in weather conditions. It leads to major threat for human beings. The drastic changes in weather, makes people difficult in predicting the climatic conditions. Therefore, highly scientific techniques like Machine learning algorithms are required to predict the weather conditions. Many tools and techniques are available to collect the weather data. Out of the several techniques used for weather forecasting, data mining approach is considered as the most feasible approach. This paper makes an analysis of the various applications of data mining in weather forecasting.
International Journal for Research in Applied Science and Engineering Technology
Sameer Kaul
International Journal of engineering Research and science
ashwini mandale
Weather forecasting is an important application in meteorology and has been one of the most scientifically and technologically challenging problems around the world. In this paper, we analyse the use of data mining techniques in forecasting weather. This can be carried out using Artificial Neural Network and Decision tree Algorithms and meteorological data collected in specific time. The performance of these algorithms was compared using standard performance metrics, and the algorithm which gave the best results used to generate classification rules for the mean weather variables. The results show that given enough case data mining techniques can be used for weather forecasting.
International Journal IJRITCC
Data mining is the computer assisted process of digging through and analysing enormous sets of data and then extracting the meaningful data. Data mining tools predicts behaviours and future trends, allowing businesses to make proactive decisions. It can answer questions that traditionally were very time consuming to resolve. Therefore they can be used to predict meteorological data that is weather prediction. Weather prediction is a vital application in meteorology and has been one of the most scientifically and technologically challenging problems across the world in the last century. Predicting the weather is essential to help preparing for the best and the worst of the climate. Accurate Weather Prediction has been one of the most challenging problems around the world. Many weather predictions like rainfall prediction, thunderstorm prediction, predicting cloud conditions are major challenges for atmospheric research. This paper presents the review of Data Mining Techniques for Weather Prediction and studies the benefit of using it. The paper provides a survey of available literatures of some algorithms employed by different researchers to utilize various data mining techniques, for Weather Prediction. The work that has been done by various researchers in this field has been reviewed and compared in a tabular form. For weather prediction, decision tree and k-mean clustering proves to be good with higher prediction accuracy than other techniques of data mining.
Himanshu Arora
Time series data available in huge amounts can be used in decision-making. Such time series data can be converted into information to be used for forecasting. Various techniques are available for prediction and forecasting on the basis of time series data. Presently, the use of data mining techniques for this purpose is increasing day by day. In the present study, a comprehensive survey of data mining approaches and statistical techniques for rainfall prediction on time series data was conducted. A detailed comparison of different relevant techniques was also conducted and some plausible solutions are suggested for efficient time series data mining techniques for future algorithms.
Dr. Divya Chauhan
International Journal of Information Engineering and Electronic Business
Folorunsho Olaiya
Risul Islam Rasel
Weather forecasting for an area where the weather and climate changes occurs spontaneously is a challenging task. Weather is non-linear systems because of various components having a grate impact on climate change such as humidity, wind speed, sea level and density of air. A strong forecasting system can play a vital role in different sectors like business, agricultural, tourism, transportation and construction. This paper exhibits the performance of data mining and machine learning techniques using Support Vector Regression (SVR) and Artificial Neural Networks (ANN) for a robust weather prediction purpose. To undertake the experiments 6-years historical weather dataset of rainfall and temperature of Chittagong metropolitan area were collected from Bangladesh Meteorological Department (BMD). The finding from this study is SVR can outperform the ANN in rainfall prediction and ANN can produce the better results than the SVR.
IJCSNS International Journal of Computer Science and Network Security
Somia A . Asklany
In Meteorological field, where a huge database takes place; weather prediction is a vital process as it affects people's daily life. In the last century, the accuracy of weather predictions has been one of the most challenging concern facing meteorologists around the world. Atmospheric dust is considered to be a harmful air pollutant causing respiratory diseases and infections from one side as well as affecting the earth's energy budget from the other side, so an early prediction of dust phenomena occurrence can be very useful in reducing its harmful effects. Data mining is mainly a machine learning process for extracting useful information form extremely large data base as it is capable of handling huge, noisy, ambiguous, random and missing data, so it represents a very helpful tool in predicting different weather elements. The virtue of using data mining techniques is that they not only analyse the huge historical data base, but also learn from it for future predictions. In this work, we investigate the use of data mining techniques in forecasting different atmospheric phenomena specially atmospheric dust using Decision Tree, k-NN and Naïve biased algorithms as well as making a comparison between them by evaluating each model results. The proposed models are implemented using the open source data mining tool Rapidminer.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Computer Applications
Ravi Khatri
Asian Journal of Research in Computer Science
Dathar Abas Hasan
Manoj Chaudhari
Rohini Patil
International Journal of Scientific Research in Computer Science, Engineering and Information Technology
International Journal of Scientific Research in Computer Science, Engineering and Information Technology IJSRCSEIT
Procedia Computer Science
Martin Gažák
International Journal of Advanced Computer Science and Applications
Addisu Mulugeta
IJAERS Journal
Nithin Chowdary
Godfrey Onwubolu
International Journal of Scientific Research in Science, Engineering and Technology IJSRSET
Anderson Namen
Data Warehousing and …
Sergio Viademonte, PhD.
IRJET Journal
Sakinat Tijani -Folorunso
International Journal of Advance Research in Computer Science and Management Studies [IJARCSMS] ijarcsms.com
Durga Charan
Arthur Kordon
International Journal of Advances in Life Science and Technology
Sherko Murad
Thatiparti venkata Rajini Kanth , Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)
DR. A. GOVARDHAN , Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)
Wireless Networks, Information …
Anwar Mirza
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Please log in to your account
Bibliometrics & citations, view options, 1 introduction.
3 generative adversarial networks, 3.1 background.
3.3 popular datasets.
Name (Year) | Data Type | Instances | Attributes |
---|---|---|---|
Oxford-Man Institute “realized library” (updated daily) | Real multivariate time series | >2,689,487 | 5 |
EEG Motor Movement/Imagery Dataset (2004) | Real multivariate time series | 1,500 | 64 |
ECG 200 (2001) | Real univariate time series | 200 | 1 |
Epileptic Seizure Recognition Dataset (2001) | Real multivariate time series | 11,500 | 179 |
TwoLeadECG (2015) | Real multivariate time series | 1,162 | 2 |
MIMIC-III (2016) | Real, integer, and categorical multivariate time series | – | – |
EPILEPSIAE project database (2012) | Real multivariate time series | 30 | – |
PhysioNet/CinC (2015) | Real multivariate time series | 750 | 4 |
Wrist PPG During Exercise (2017) | Real multivariate time series | 19 | 14 |
MIT-BIH Arrhythmia Database (2001) | Real multivariate time series | 201 | 2 |
PhysioNet/CinC (2012) | Real, integer, and categorical multivariate time series | 12,000 | 43 |
KDD Cup Dataset (2018) | Real, integer, and categorical multivariate time series | 282 | 3 |
PeMS Database (updated daily) | Integer and categorical multivariate time series | – | 8 |
Nottingham Music Database (2003) | Special text format time series | 1,000 | – |
4.1.1 sequence gan (seqgan) (sept. 2016)..
4.2.1 continuous rnn-gan (c-rnn-gan) (nov. 2016)..
4.2.5 decision-aware time series conditional gan (dat-cgan) (sept. 2020)., 4.2.6 recurrent conditional gan (rcgan) (2017)..
5.1 data augmentation.
5.4 anomaly detection, 5.5 other applications, 6 evaluation metrics.
Application | GAN Architecture(s) | Dataset(s) | Evaluation Metrics |
---|---|---|---|
Medical/physiological generation | LSTM-LSTM [ , , , , , ] LSTM-CNN [ , ] BiLSTM-CNN [ ] BiGridLSTM-CNN [ ] CNN-CNN [ , ] AE-CNN [ ] FCNN [ ] | EEG, ECG, EHRs, PPG, EMG, speech, NAF, MNIST, synthetic sets | TSTR, MMD, reconstruction error, DTW, PCC, IS, FID, ED, S-WD, RMSE, MAE, FD, PRD, averaging samples, WA, UAR, MV-DTW |
Financial time series generation/prediction | TimeGAN [ ] SigCWGAN [ ] DAT-GAN [ ] QuantGAN [ ] | S& P 500 index (SPX), Dow Jones index (DJI), ETFs | Marginal distributions, dependencies, TSTR, Wasserstein distance, EM distance, DY metric, ACF score, leverage effect score, discriminative score, predictive score |
Time series estimation/prediction | LSTM-NN [ ] LSTM-CNN [ ] LSTM-MLP [ ] | Meteorological data, Truven MarketScan dataset | RMSE, MAE, NS, WI, LMI |
Audio generation | C-RNN-GAN [ ] TGAN (variant) [ ] RNN-FCN [ ] DCGAN (variant) [ ] CNN-CNN [ ] | Nottingham dataset, midi music files, MIR-1K, TheSession, speech | Human perception, polyphony, scale consistency, tone span, repetitions, NSDR, SIR, SAR, FD, t-SNE, distribution of notes |
Time series imputation/repairing | MTS-GAN [ ] CNN-CNN [ ] DCGAN (variant) [ ] AE-GRUI [ ] RGAN [ ] FCN-FCN [ ] GRUI-GRUI [ ] | TEP, point machine, wind turbine data, PeMS, PhysioNet Challenge 2012, KDD CUP 2018, parking lot data, | Visually, MMD, MAE, MSE, RMSE, MRE, spatial similarity, AUC score |
Anomaly detection | LSTM-LSTM [ ] LSTM-(LSTM& CNN) [ ] LSTM-LSTM (MAD-GAN) [ ] | SET50, NYC taxi data, ECG, SWaT, WADI | Manipulated data used as a test set, ROC curve, precision, recall, F1, accuracy |
Other time series generation | VAE-CNN [ ] | Fixed length time series “vehicle and engine speed” | DTW, SSIM |
Architecture | Loss Function | Toy Sine Dataset | ||
---|---|---|---|---|
MMD | DTW | MSE | ||
LSTM-LSTM | BCE | 0.9527 | 91.1071 | 0.2308 |
MSE | 0.0078 | 54.1644 | ||
BiLSTM-LSTM | BCE | 0.1215 | 428.4310 | 3.0700 |
MSE | 0.9515 | 79.5607 | 0.2362 | |
LSTM-CNN | BCE | 0.006 | 55.3620 | 0.3154 |
MSE | 0.5757 | 86.7357 | 0.5643 | |
BiLSTM-CNN | BCE | 129.9257 | 0.9193 | |
MSE | 0.4891 | 43.2694 | 0.1869 | |
GRU-CNN | BCE | 0.0244 | 0.2303 | |
MSE | 0.3727 | 42.7348 | 0.22823 | |
FC-CNN | BCE | 0.0039 | 58.3565 | 0.3048 |
MSE | 0.0117 | 43.3611 | 0.2972 |
Architecture | Loss Function | MIT-BIH Arrhythmia Dataset | ||
---|---|---|---|---|
MMD | DTW | MSE | ||
LSTM-LSTM | BCE | 0.9931 | 30.1816 | 0.0867 |
MSE | 0.8842 | 44.4553 | 0.1389 | |
BiLSTM-LSTM | BCE | 0.9916 | 22.8634 | 0.0699 |
MSE | 0.9737 | 23.5533 | 0.0806 | |
LSTM-CNN | BCE | 0.5519 | ||
MSE | 24.7306 | 0.0457 | ||
BiLSTM-CNN | BCE | 0.9246 | 117.3994 | 0.2272 |
MSE | 0.0687 | 22.6740 | 0.0586 | |
GRU-CNN | BCE | 0.0055 | 20.4845 | 0.0335 |
MSE | 0.7704 | 108.4124 | 0.1948 | |
FC-CNN | BCE | 0.2068 | 23.9910 | 0.0309 |
MSE | 0.3082 | 18.2340 | 0.0212 |
7.2 decentralized/federated learning, 7.3 assessment of privacy preservation, 8 discussion, 9 conclusion.
Computing methodologies
Machine learning
Machine learning approaches
Capsulegan: generative adversarial capsule network.
We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) ...
Deep neural networks (DNN) have achieved remarkable performance in various fields. However, training a DNN model from scratch requires expensive computing resources and a lot of training data, which are difficult to obtain for most individual ...
Generative Adversarial Networks (GANs) is a novel class of deep generative models that has recently gained significant attention. GANs learn complex and high-dimensional distributions implicitly over images, audio, and data. However, there exist major ...
Published in.
University of Sydney, Australia
Association for Computing Machinery
New York, NY, United States
Permissions, check for updates, author tags.
Other metrics, bibliometrics, article metrics.
View or Download as a PDF file.
View online with eReader .
View this article in HTML Format.
Check if you have access through your login credentials or your institution to get full access on this article.
Share this publication link.
Copying failed.
Affiliations, export citations.
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
Journal of Big Data volume 10 , Article number: 83 ( 2023 ) Cite this article
7170 Accesses
11 Citations
1 Altmetric
Metrics details
Big data has a substantial role nowadays, and its importance has significantly increased over the last decade. Big data’s biggest advantages are providing knowledge, supporting the decision-making process, and improving the use of resources, services, and infrastructures. The potential of big data increases when we apply it in real-time by providing real-time analysis, predictions, and forecasts, among many other applications. Our goal with this article is to provide a viewpoint on how to build a system capable of processing big data in real-time, performing analysis, and applying algorithms. A system should be designed to handle vast amounts of data and provide valuable knowledge through analysis and algorithms. This article explores the current approaches and how they can be used for the real-time operations and predictions.
The concept of big data was mentioned for the first time in a paper published in 1997 [ 1 ]. The authors called the problem of dealing with large data sets, “the problem of big data”. These large data sets were characterized by not fitting in the main memory, making it challenging or even impossible to analyze and visualize them. Even 25 years later, most computers cannot load 100 GB to memory, let alone process it.
In the current era in which data is produced at high rates, information has a decisive role, and most computers cannot process vast amounts of data; thus, it was necessary to create new ways to process the data. These aspects were the big impulse for the appearance of big data technologies.
The first approach to deal with big data sets was to divide them into smaller segments. However, even then, the segments could be very large in most cases. Besides, few computers were able to make this type of processing. To tackle this issue, frameworks started to appear to deal with batches of data. Nevertheless, none of these approaches deals with one big problem: what can be done if the data set keeps growing, and data continues to be received over time? To answer this question, several frameworks that deal with data streams have appeared.
The main goals of using big data are: (1) predicting future events, and (2) gaining insights and discovering relationships; in multidimensional and large sample-sized datasets [ 2 ]. However, these goals bring challenges in terms of computation and methods.
Predicting future events is also known as forecasting. Forecasting tasks foresee dealing with time series data. Processing and analyzing time series data in real-time can be a game-changer for an organization. This article will focus on time series data. Three tasks stand out on the analysis and prediction of time series data: monitoring, forecasting, and anomaly detection. These tasks benefit from being executed in real-time. Moreover, these tasks can be applied to many contexts and use cases. Therefore, it is important to use a streaming framework to process data as it arrives.
Anomaly detection in data streams is beneficial and essential for organizations to detect problems before they achieve more significant dimensions: for instance, to notice an intrusion before the intruder can steal or damage data. Another example is to detect unexpected traffic congestion and activate the responsible authorities. Therefore, the anomaly prediction connected to time series data will also be dealt in this article.
Using data streams in different contexts allows us to extract knowledge and make decisions in real-time (or near real-time). This article will explore how we can deal with big data, particularly, time series big data. This article will also analyse which algorithms can be applied to data to make forecasts and detect anomalies.
The main contributions of this work can be summarized as follows:
A comparative analysis of Stream Processing Engines (SPEs), including their characteristics and provenance, processing techniques, delivery of events, performance, and popularity.
A discussion on forecasting algorithms, including statistical and Machine Learning (ML) algorithms, and the advantages and disadvantages of using each type of algorithm.
A discussion on anomaly detection algorithms, the challenges of working with datasets containing anomalies, and the methods used to detect anomalies, such as statistical and ML approaches.
A comparative analysis of SPEs led us conclude that Spark is the most popular framework; however, Flink is better for data-intensive applications, and Heron scales better. Forecasting and anomaly detection methods bring value to organizations. While forecasting can allow better management of resources, anomaly detection can mitigate and eliminate problems. Regarding the type of methods used, statistical methods are usually lighter and more explainable, while machine learning methods are better when we have complex hidden patterns. The most recent published papers show a preference for deep learning techniques.
Working with huge amounts of streaming time series data can be a challenging task. With this in mind, we want to guide the reader on how this can be achieved. We will focus on three key relevant aspects:
Stream processing frameworks: these frameworks enable to process huge amounts of data, perform analysis, and apply algorithms in real-time.
Forecasting algorithms: these algorithms allow to predict future events. Therefore, they are essential for many organizations to perform informed decisions, manage resources, improve services, among others.
Anomaly detection algorithms: these algorithms allow to identify abnormal or unusual patterns. They can be early symptoms of something wrong, and we should be careful. They help us to improve security, quality, and efficiency.
Although the main focus of this work is the literature review on streaming frameworks, since we aim to work with time series data, we will also review the forecasting and anomaly detection algorithms; they play a crucial role in taking advantage of real-time processing capabilities. Therefore, with this survey, we aim to:
Identify the most relevant state-of-the-art regarding both data streams and algorithms.
Evaluate and compare different frameworks and methods to highlight each method or framework’s strengths, weaknesses, and limitations and when they should be applied.
Provide a guide for future research by identifying gaps in the current literature, areas that need further investigation, and other opportunities.
This subsection provides an overview of other related surveys presented in the literature. Table 1 summarizes the subjects mentioned in the works presented in this article, both surveys and research works. In this section we will address the survey articles.
This article presents a literature review on how to process huge amounts of time series that are continuously being produced over time and need to be processed in real-time. Therefore, in Table 1 , we consider papers regarding big data, stream processing, real-time processing, machine learning and deep learning, forecasting, and anomaly detection. In addition, we revised both surveys and research articles. Unfortunately, to the best of our knowledge, we did not find a paper analyzing all these topics. Nevertheless, we will compare our study with the most relevant works.
The most significant difference with work [ 9 ] regarding big data streams is that the authors of work [ 9 ] compared several tools, technologies, methods and techniques regarding data streams. However, we are more focused on data stream processing frameworks. In addition, the authors of [ 3 ] also discussed the concept of real-time associated with the processing of data streams, while the authors of [ 10 ] only perform a brief comparison of streaming processing frameworks. The authors of [ 10 ] conducted some practical evaluations of the streaming processing frameworks. Our survey presents a literature review. Similar to the work presented in [ 11 ], we are also researching progress in big data-oriented stream data mining; however, we focus on time series related problems, namely forecasting and anomaly detection.
The remainder of this article is organized as follows. " Big data stream processing frameworks " section is focused on big data and data stream processing frameworks. It starts by discussing the problem definition, followed by existing solutions, it presents the elaborations and a summary. This section characterizes big data and discusses its relationship with data streams, forecasting methods, and anomaly detection. We also present frameworks for processing data streams, compare them, and discuss some example cases where each one can be applied. Next, " Analysis and algorithms for streaming data " section discusses algorithms that can be applied in the context of big data, namely forecasting concepts and methods (" Time series forecasting " section) and anomaly detection strategies (" Anomaly detection " section). In this section, we focus on statistical, ML, and Deep Learning (DL) methods and their advantages and disadvantages. Each of these 2 sections presents a similar organization. Finally, " Conclusions and future research directions " section presents the conclusions and the challenges envisaged for future work, as well as some future research directions.
Problem definition.
The evolution of traditional systems to streaming systems brings new processing and analysis capabilities and challenges. Firstly, we are no longer limited to bounded data, since we can process bounded and unbounded data. We are no longer required to divide or process data into multiple steps. Usually, a single step is enough. Besides, we no longer have to wait long periods for data to be processed. As we receive data, we process and obtain results and insights.
Designing the architecture of an application is an important task that should be well thought out. Considering that the streaming processing is part of an entire system, as a first step in the deployment of this component, the system requirements should be analyzed and task prioritization shall be evaluated. Choosing a SPE is not different. Some of the desired requirements that might be considered for real-time data stream processing are:
Process large volumes of data;
Integrate data from multiple data sources;
Deal with data with different properties (multi-dimensional data, multiple entities, spatial-temporal dependencies);
Deal with bounded and unbounded data streams;
Deal with unsorted data, or delayed data;
Detect data anomalies;
Computation performance metrics (low latency, high throughput, high availability, high scalability).
As we stated before, the true value of big data comes from taking insights from the data and helping decision-makers. Therefore, efficient and precise algorithms implemented on scalable frameworks are needed to explore the data potentials. If we consider ML and DL in our analysis, we might add the model performance (error and training time) to the list. In the context of forecasting, metrics such as the Mean Squared Error (MSE) or \(\hbox {R}^{{\textbf {2}}}\) -Score can be useful [ 38 ]. In the case of anomaly detection, we may choose a high accuracy, high precision or even high recall method [ 16 ]. Since explanations play a crucial role in decision-making, the explainability of the ML model should also be considered [ 78 ].
There are several SPEs. Each SPE provides different features and has different properties. Moreover, each one can be more or less adequate according to the application.
The concept of big data has evolved through the years. First, big data started being depicted as a massive amount of data that does not fit in the main memory and requires more sophisticated ways of processing and visualizing [ 1 ]. This definition remains true; however, it is incomplete, since it is always being updated due to the data explosion [ 18 ] that occurred during the last decades. Defining big data is not a simple task because of its complexity. Figure 1 summarizes big data characteristics, challenges and opportunities.
Big data taxonomy—information collected from [ 2 , 5 , 15 , 17 , 19 ]
As previously mentioned, this massive amount of data is characterized by massive sample size and high dimensionality [ 2 ]. Besides, data can arrive at high velocities and different flow rates [ 19 ]. Moreover, data can come from different sources [ 2 ], making it more complex. Data stream frameworks can receive data from multiple sources and process huge volumes of data, continuously arriving at high velocities. Several factors increase the complexity of dealing with big data, such as the variety of data that can be received [ 19 ]. For example, we can receive numerical values, text, images, sounds, video, or a combination of more than one type. In addition, our data can have a temporal component that brings additional complexity to the problem.
The maximum potential of big data is achieved when we trust the data and take advantage of it by analyzing it. Thus, we must identify inaccurate and uncertain data and deal with it [ 19 ]. In this context, the importance of anomaly detection methods is highlighted, especially the real-time detection of anomalies in data streams to mitigate anomalies as soon they happen.
Some of these characteristics bring statistical, computational, and visualization problems. For example, we can have algorithm instability, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors regarding statistical problems [ 2 ]. On the other hand, regarding computation problems, we have storage, scalability, and bottleneck problems [ 2 , 79 ]. Finally, visualization can be complex or even impossible when we have high-dimensional data.
Statistical problems can bring dangerous consequences, since they can lead to wrong statistical inferences or false scientific discoveries. For instance, an excellent example of a spurious correlation is the strong correlation (99.79%) between “US spending on science, space, and technology” and “Suicides by hanging, strangulation and suffocation” [ 80 ]. As we can understand, these two phenomena are unrelated. This is a well-known phenomenon in statistics, meaning that correlation does not imply causality. However, spurious correlations can go unnoticed depending on the context and the available knowledge.
To summarize, big data requires demanding computational resources, and its potential is unlocked through trust in data analysis. Therefore, several streaming frameworks emerged to process big amounts of data with low latency, high throughput, and high scalability. Furthermore, anomaly detection methods are essential in data streams [ 19 ], since they can suffer security attacks, have malfunctioning devices, or something unexpected may occur. We can also execute these methods in batch; however, when applied to real-time streaming data, they achieve their full potential. Besides, big data allows to (1) forecast future events, and (2) gain insights and discover relationships in data [ 2 ], both being important tasks, especially for decision-makers.
Big data analysis, forecasting, and anomaly detection are achieved through statistical, machine learning, or deep learning methods. Note that deep learning is a subset of machine learning. Figure 2 depicts Google searching trends through the years, by keywords. Big data, machine learning, and deep learning have a growing trend over the years. On the other hand, anomaly detection had a very soft increase. The searching trend forecasting decreases and reaches its peak in 2022; however, we can use other terms to express forecasting, such as prediction. Note that Google trends do not allow complex queries.
Google research trends over time—data collected from [ 81 ]
We can apply big data to a vast amount of scientific fields. We will present examples of use cases and applications for analyzing time series data streams in real-time. We will also include some examples that benefit from forecasting or anomaly detection methods.
In finances and economics, monitoring the stock market, detecting fraud, or forecasting the performance of assets, are high relevant tasks. In [ 25 ], the authors used Artificial Neural Network (ANNs) and data streams to forecast stock prices. Monika Arya et al. [ 21 ] proposed a real-time method to detect credit card fraud in data streams, using ANNs with ensemble trees.
Regarding health care and well-being, monitoring patients and having real-time processing capabilities can save lives. For instance, Leo Kobayashi et al. [ 82 ] created a patient monitor system using streams and multimodal data fusion. Their approach allowed them to analyse the data, conduct experiments and develop and apply algorithms. Another interesting application is to monitor and forecast the spread of infectious diseases. For instance, Ensheng Dong et al. [ 83 ] created an interactive dashboard to monitor COVID-19 using data streams.
We can also find works that benefit from using frameworks to process data streams in informatics and communications, such as monitor resource usage or detect security attacks. In [ 4 ], the authors propose an internet traffic monitoring system using streaming frameworks. And in [ 7 ], Liu et al. perform resource management and scheduling.
Other main areas with big data characteristics are smart cities and industry 4.0. One significant advantage is that they allow the creation of living labs, creating a space for learning and innovation. We can find several works to monitor and improve urban mobility, monitor water consumption and detect water leaks [ 84 ], and forecast traffic flow [ 38 ], among many others. Leonhard Hennig et al. [ 23 ] built a system to extract mobility and industry events from data streams. Qinglong Dai et al. [ 13 ] used a data stream framework with customized changes to process data from smart grids. Still, in the context of energy systems, Philsy Baban [ 24 ] could process and validate real-time streaming data. In [ 8 ], Sahal et al. discussed streaming frameworks and other tools to perform predictive maintenance for railway transportation and wind energy.
As can be observed, we can find big data applications in several different fields. Society can benefit greatly from big data; however, big data can also be dangerous. In this article, we will not explore the “dark side” of big data. For instance, it can serve for mass surveillance and persecution or increase the disparities among minorities. However, we hope that governments and institutions use big data for good. In this context, it emerged a new research area: “fair AI”, whose biggest goal is to combat racism, sexism, and other types of discrimination against minorities [ 85 ].
We use the term “big data” to define huge amounts of data [ 1 ] and the term “stream” to express data continuously being created and arriving [ 86 ]. This data can come from different sources and have different formats; its processing is not always trivial, especially if it is required in real-time.
Big data applications can have five types of components: data sources, a messaging platform, a processing module, a storage mechanism, and a presentation module. The data sources can be, among others, Internet of Things (IoT) sensors and social networks. These sources of information usually come from users, devices or activity logs. The messaging platform is responsible for sending data between modules. The processing module can be a streaming processing framework to ensure real-time processing capabilities. The storage mechanism can be a database or a data warehouse. Processed data can be presented in different ways, such as a web application, a mobile application, and a technical report. Figure 3 depicts the components of big data applications.
Big data applications components
Fundamental concepts.
In " Problem definition " we mentioned application requirements that can restrict the choice of a SPE. Now, we will discuss fundamental concepts that make it possible to have different data-processing techniques.
We may consider three types of processing: batch-based, stream-based, or event-based [ 87 ]. Batch processing is characterized by processing bounded data streams with a beginning and an ending. On the contrary, stream processing is characterized by the processing of unbounded data streams that do not have a known end. Besides, the data processing is performed as data arrives. If our application requires that we generate alerts or triggers if our data meets some conditions, we have event-based processing.
Concerning the processing model, we also have three types: at most once, at least once, and exactly once. At most once processing does not guarantee that the data is processed or persisted. In case of failure, we may have to deal with missing data. Usually, applications that choose at most once processing are more concerned with latency than reliability. On the contrary, at least once processing may process or persist duplicated data, but at least it guarantees that every data is processed or persisted at least one time. At last, exactly once processing just processes or persists data once.
Window mechanisms specify how to divide the stream in order to aggregate time series data. There are six main processing techniques [ 26 , 88 ]. The most basic mechanism is the single-pass in which we process each new sample only once. Several windowing mechanisms will be discussed. Nevertheless, a windowing mechanism can be defined as a function of the time or the number of events [ 27 ]. A sliding window mechanism is defined as a window with a fixed size that slides over the data stream [ 26 ]. Tumbling windows are non-overlapping sliding windows [ 88 ]. Session windows are similar to tumbling windows; however, in session windows, we have a gap between windows [ 88 ]. In a landmark window, it is specified a sample from which the window keeps growing [ 26 ]. This sample can be updated from time to time. At last, the damped window mechanism uses a fading mechanism in which, the most recent samples have a bigger weight, and, as time goes by, the samples loose their weight [ 26 ]. Figure 4 represents some of these window mechanisms.
Processing window mechanisms
Regarding stream-based processing, its methods can be considered stateless or stateful. If the processing is stateless, then the state is not preserved. We can use stateful processing if we want to know how many people buy a specific game per month. On the other hand, if the state is retained, the processing is stateful. This can be useful to measure how many people buy the game over time in a commulative maner.
As aforementioned, we will discuss and compare different SPEs. We selected six SPEs: Apache Spark, Apache Flink, Apache Storm, Apache Heron, Apache Samza, and Amazon Kinesis. Besides, we decided to include Apache Hadoop for historical reasons.
Hadoop Footnote 1 was the first framework that appeared to process large datasets using the MapReduce programming model. Hadoop is very scalable, since it can run on a single cluster, in a single machine, or spread on several clusters in multiple machines. Moreover, Hadoop takes advantage of distributed storage to improve performance by transmitting the code that processes the data instead of the data [ 89 ]. Besides, Hadoop provides high availability and high throughput. However, it can have efficiency problems when dealing with small files.
The major drawback of using Hadoop is that it does not support real-time stream processing. To deal with this problem, Apache Spark emerged. Spark Footnote 2 is a framework for processing batch and streaming data, and allows distributed processing. According to Matei Zaharia [ 90 ], the creator of Spark, Spark was designed to respond to three big problems of Hadoop:
Avoid iterative algorithms that make several passes through the data;
Allow real-time streaming;
Allow interactive queries.
Instead of MapReduce, Spark uses Resilient Distributed Datasets (RDDs) that are fault-tolerant and can be processed in parallel. Spark also provides scalability, and since its early releases, it has proved to outperform Hadoop [ 33 ]. Spark is helpful for data science related projects. Besides its main component, Spark provides several libraries for Exploratory Data Analysis (EDA), ML, graph analysis, stream processing and SQL analytics.
Two years later, Apache Flink Footnote 3 and Apache Storm Footnote 4 were created. While Spark uses micro-batches for stream processing, Flink and Storm can perform stream processing natively. Flink can process batch and streaming data. In Flink, we can process streams with specific temporal requirements. For example, we may consider processing or event time. In case of event time, Flink allows to deal with delayed events. Besides, Flink provides watermark support, allowing a trade-off between latency and completeness of data. Storm and Flink are similar frameworks, generating some discussion regarding their differences [ 91 ] and which of the following stand out:
Storm only allows stream processing;
They both can perform stream processing with low latency;
The API offered by Flink is more high-level and provides more functionalities;
They have different strategies to provide fault tolerance (Storm employs record-level acknowledgements while Flink uses a snapshot algorithm).
Storm is a good streaming framework; however, its capabilities to scale are not enough for more demanding applications. Besides this, debugging and managing Storm can be complex tasks. In this context, Apache Heron Footnote 5 emerges, as the successor of Storm. A paper published in 2015 [ 34 ] announced this transition at Twitter.
Apache Samza Footnote 6 is a framework that provides real-time processing, event-based applications, and Extract, Transform and Load capabilities. Samza provides several APIs and presents an architecture similar to Hadoop, but instead of using MapReduce, it has the Samza API, and it uses Kafka instead of the Hadoop Distributed File System .
Finally, Amazon Kinesis Footnote 7 is the only framework presented in this article that does not belong to the Apache Software Foundation. Kinesis is actually a set of four frameworks instead of a data stream framework. In this work, we refer to Amazon Kinesis to talk about the Kinesis Data Streams framework to simplify. Kinesis can easily be integrated with Flink.
The processing frameworks present different properties, which makes it challenging to choose one framework without understanding the differences. Therefore, we should choose the framework that suits best our use case.
Firstly, we decided to look at the nature of each framework. Although several frameworks belong to the Apache ecosystem, most were not created by Apache. They were later integrated into the Apache family through The Apache Incubator. Footnote 8 Table 2 resumes the nature of each one of them.
Table 3 contains information about the processing techniques available (batch or stream) and the delivery of events (at most once, at least once, exactly once). As we already mentioned, Hadoop only provides batch processing. Storm and Heron only provide stream processing. All other frameworks offer both batch and stream processing. However, Spark provides stream processing through micro-batches. Regarding the delivery of events, most frameworks guarantee that the events are processed exactly once or at least once. Heron offers three types of delivery, the two mentioned above and at most once. Besides, these frameworks provide drivers for several programming languages, the most popular are Python and Java.
Performance-wise, some experiments have been conducted to compare the different SPEs. Note that it is difficult to make a fair comparison due to the lack of experiments that contemplate all frameworks. Therefore, we started by a performance comparison regarding the frameworks. This comparison considers the information available in the official documentation of each framework, which is present in Table 4 . One of the most important characteristics when choosing a framework is the ability to process information in real-time. However, there needs to be a consensual definition of what real-time means. Gomes et al. [ 3 ] focused their study on this concept in the context of data streams and big data. According to the authors, there are different intents when discussing real-time. For example, real-time could mean an immediate response. Another possibility is the guarantee of low latency: some consider the time the system should answer, while others refer the time the system must answer. For a more fair comparison, in this discussion, we will focus on real-time as the property of having low latency.
Most of these frameworks present low latency, which is good when we are processing significant amounts of data and want to process it in real-time. Hadoop is the only one that is considered to have high latency. All frameworks present high throughput and high scalability. However, Hadoop only allows scaling vertically. Regarding fault tolerance mechanisms, all frameworks deal with fault tolerance.
After this initial study, we look for works that compare some of these frameworks to make an unbiased comparison. In 2015, Namiot et al. [ 10 ] made an introductory comparison of the properties of Storm, Spark, Samza, Apache Flume, Apache Kafka, Amazon Kinesis, and IBM InfoSphere.
Besides the noticeable differences between Hadoop and Spark, Pooja Choudhary et al. [ 28 ] conducted some experiments to compare these two frameworks. They concluded that Spark uses more memory than Hadoop, needing less execution time. However, the authors of [ 35 ] mentioned that Spark might not be the best framework if our application requires low latency and high throughput.
The authors of [ 29 ] compared the performance of Spark, Flink, and Storm under saturation conditions (the maximum streaming load that the frameworks could support without delay). This comparison is insightful if we want to choose the best framework for a data-intensive application. Flink presented the highest saturation level, while Storm had the worst CPU usage. Even when failure recovery mechanisms are activated, Storm performance decreases by 50%, while Flink only decreases 10%. Nevertheless, Spark can surpass Flink if we are not concerned with latency.
Inoubli et al. [ 12 ] performed experiments in which they compared Spark, Storm, Flink, and Samza. They observed that Spark achieved the worst processing rates compared to the other three frameworks. Flink and Samza were more efficient, especially when messages had a more considerable size. Flink CPU usage was lower; however, Flink could outperform Storm if the CPU consumption allowed was increased. Spark requires more RAM, less disk access, it is slower in processing messages, and uses less bandwidth.
In 2019, in the context of a smart city, Hamid Nasiri et al. [ 30 ] evaluated three different frameworks: Spark, Flink and Storm. They started by fixing the input rate and compared the performance with two nodes versus eight nodes. With two nodes, Flink presented the lowest latency and the highest throughput. Flink delivered a similar performance with a slightly higher throughput with eight nodes. The improvements on Spark and Storm were more significant, but Flink was still the best. On the other hand, Spark had the worst latency. With eight nodes, Spark presented a similar throughput to Flink; however, it reached the highest throughput peaks. They analyzed the impact of changing the input rate and the number of worker nodes. We can conclude that the performance of Flink is similar to Storm, even when using no acknowledgements in Storm. The most significant difference is the throughput in which Flink is better than Storm; however, Storm seems to scale better, and with eight nodes, Spark is the best of them all in terms of throughput. At last, they measured CPU and network utilization. Flink achieved the lowest CPU utilization and the highest network utilization. Storm and Spark achieved similar performances.
Kolajo et al. [ 9 ] compared 19 tools and technologies for data streaming; however, only half of them supported both batch and streaming processing. On another work [ 31 ], in 2019, the authors compared the performance of five stream processing systems: Storm, Flink, Spark, Kafka Stream, and Hazelcast Jet. Storm has the best memory consumption, and presents good stability. Flink presents the lowest latency. Spark presents the highest throughput and has a good compatibility with ML libraries.
In 2020, LinkedIn published a post [ 92 ] showing some improvements performed on Samza. These improvements provided Samza with more considerable throughput capabilities when compared with Flink.
Later in 2021, Krzysztof Wecel et al. [ 32 ] selected six frameworks, but has chosen to focus their analysis on comparing Spark and Flink. They concluded that Spark is more memory efficient while Flink is more CPU efficient. The authors also mentioned that, while performing their experiment, they found a problem that led to delays in the implementation phase: missing detailed documentation. We were already aware of this problem, especially with Flink.
Heron brings an extensive set of advantages to users that want to transit from Storm to a more scalable framework. The API available for Heron is compatible with the one available for Storm. Heron requires fewer resources (less CPU usage) and provides performance improvements (more throughput and less latency). Currently, Heron is in the incubating phase at The Apache Incubator [ 93 ].
To understand the frameworks popularity, we decided to perform two experiments using Scopus. Footnote 9 These experiments were performed on August 9th, 2022. In the first experiment, we try to understand the popularity of the different frameworks over the years. In the second experiment, we try to perceive how many publications exist when we consider different criteria.
For the first experiment, we created three queries. The example below contains the queries for the Apache Hadoop framework. Similar queries were performed for the remaining frameworks.
apache w/ hadoop
TITLE-ABS-KEY (apache w/ hadoop)
TITLE-ABS-KEY (apache w/ hadoop) AND (LIMIT-TO (SUBJAREA,“COMP”) OR LIMIT-TO (SUBJAREA,“ENGI”))
Firstly, we perform a general search using only the framework’s name. Secondly, we restrict papers with the framework’s name in the title, abstract or keywords. Lastly, we limit the subject area to papers published in the engineering field or computer science.
Figure 5 contains the results of the first query. We can visualize that Hadoop is the dominant framework in the first years. This happens because Hadoop is the oldest, and most frameworks did not exist or did not belong to the Apache Software Foundation at the time. The most popular streaming framework is Spark. Following Spark, the popularity of Flink and Storm is similar. Finally, Heron, Samza and Kinesis are the most unpopular frameworks.
Data processing frameworks: Popularity over the years first query
Figure 6 presents the results of the second query. When we restrict papers with the framework’s name in the title, abstract or keywords, we can visualize that Spark is the dominant framework. This might indicate that most papers that mention Hadoop only mention it because it was the first relevant framework. Another explanation is that Hadoop is the framework used in the study, but was not the subject of the study. Therefore, this second query is more focused on studying the framework, not its usage.
Data processing frameworks: Popularity over the years second query
What we can visualize in Fig. 6 is intensified in Fig. 7 when we limit the subject area. Figure 7 shows the results of the third query.
Data processing frameworks: Popularity over the years third query
In the second experiment, we evaluate the number of papers that considered stream-related concepts and algorithms. Our goal is to understand, for instance, how many articles that addressed forecasting also addressed streams. We started with two basic queries. First, query 4 helps to understand how many papers contain the word forecast or other words derivated from the word forecast, such as forecasting or forecasts. Query 5 helps to understand how many papers include anomaly detection or outlier detection. Query 6 is an additional query to understand how many papers also include ML or DL.
(anomaly w/ detection) OR (outlier w/ detection)
(machine w/ learning) OR (deep w/ learning)
Figure 8 contains the results for forecasting terms. We start by performing query 4, and we named forecast-term. Then, we also included query 6, which we called ML-term. Then, we selected only the papers that had both terms in the title, abstract, or keywords. The next step was to limit by subject area (as in the first experiment). Then, we limited the search by the years from 2012 until 2023. Finally, we included different terms in order to answer our initial question. We separated the terms stream and the several frameworks. As we can visualize, we started with 1.5 million papers, and in the end, only 1 thousand had terms related to streams.
Forecast versus Stream
Figure 9 contains the results for anomaly detection. The only thing that changed with Fig. 5 was the initial term that, in this case, was the anomaly detection term, query 5. As we can visualize, we began with 136 thousand papers, and in the end, only five hundred had terms related to streams.
Anomaly detection versus Stream
Only a few papers consider streaming and forecasting concepts because a forecasting algorithm, to provide the most benefits, should perform real-time forecasting. Moreover, given the complexity of implementing a stream-based forecasting system and a forecasting algorithm, researchers can be more focused on developing one of these tasks when they publish their work. The same can be applied to anomaly detection concepts and other applications.
Choosing the best SPE is a critical engineering task that should consider the following. Foremost, only Spark, Flink, Samza and Kinesis allow both batch and stream processing. In addition, Spark and Flink do not allow missing or repeated data. However, Heron enables the choice of any delivery. Flink is the best framework for data-intensive applications, presenting the lowest latency and highest throughput. However, Storm seems to scale better. Recent studies have proven that Samza has a better throughput than Flink, and Heron scales better than Storm. Nevertheless, Spark and Storm are the most popular stream frameworks. Heron is a good substitute for Storm, allowing Storm users to transition easily.
In the scope of ML, several tasks can take advantage of streaming technologies, such as regression, classification, clustering, forecasting, anomaly detection, and frequent pattern mining.
In this section, we decided to focus on two tasks related with time series: forecasting (" Time series forecasting " secrtion) and anomaly detection (" Anomaly detection " section).
Humans are constantly trying to predict the future. Millions of years ago, when we started counting time, we also began to make predictions. One of the questions that most hunt humanity, and that several societies, religions and individuals tried to guest, is when doomsday will occur. Several dates have been proposed over the years, but until now, none of them has been correct.
Forecasting is a prediction task in which we try to predict future events accurately. To make good forecasts, we should understand the phenomenon and the causes that influence the phenomenon. We can use historical data, events that may occur, and other information that may contribute to the forecasting task [ 94 ]. For example, when we look at the sky and see dark clouds, we can (most certainly) guess it will rain.
Accordingly, with the domain of our problem, we should look for data other than the phenomenon’s data. For instance, Wasiat Khan et al. [ 45 ] used data from social media and financial news to predict the stock market’s performance. However, the authors recognize that not all stocks are influenced the same way. Besides, the authors noticed that some stocks were more influenced by social media news, while others were more influenced by financial news. Ahmad Ali et al. [ 46 ] considered the spatial-temporal dependencies and several temporal patterns (current, daily, and weekly) to predict crow flows. The use of external factors, such as weather conditions, holidays, and events was also crucial in this context.
Forecasting tasks can be classified as short, medium or long-term forecasts [ 94 ]. These terms are used if the forecast is made for the near future, medium future or distant future. For instance, we may want to predict how many people will travel to a tourist destination in the next hour, in the next week, or in the next year.
Usually, short-term forecasting is only relevant in a short interval. Therefore, we might benefit from performing the forecasting in real-time or near-real-time. On the other hand, medium and long-term forecasting is not needed immediately; therefore, we can perform them offline.
Forecasting problems use time series data. A time series is the evolution of one variable (or more) over time. A time series is a stochastic process, time-indexed, thus making statistical properties relevant. When we only have one variable, we have a univariate time series. We have a multivariate time series when we have more than one variable. Usually, when we are in the presence of a univariate time series, we call it a time series [ 94 , 95 , 96 ].
Forecasting methods
Time series data is similar to streaming data, since we can look at the data arriving from the streaming with a temporal component and a sequential order. However, this does not mean that all data from streams are time series, even though they might have a timestamp associated.
There are three types of forecasting methods: historical, statistical, and ML. Historical methods only look at past values to forecast new ones. The most popular historical method is the Historical Average (HA), which can be found in the literature [ 47 ], especially as a baseline. Statistical methods are mainly based on the Auto Regressive (AR) method. They are also considered usually as a baseline. For instance, we can find Auto Regressive Integrated Moving Average (ARIMA) in work [ 47 ]. ML approaches, particularly DL, have been highlighted more recently, and several novelty methods have been proposed.
We can find forecasting works related to energy consumption and pricing. Bangzhu Zhu et al. [ 48 ] used an SVM-based method with mixture kernels to forecast carbon prices. Razak Olu-Ajayi et al. [ 49 ] predicted the energy consumption of buildings using ML and DL models, and concluded that ANNs are more suitable to make predictions. In [ 50 ], Zhang et al. proposed a Multi-view Ensemble Learning Model (MELM) to forecast traffic of base stations to save power in cellular networks. Their multi-view methods had four views: a temporal, a spatial, one dedicated to events, and the last view for residual information. For the temporal component, they analyzed the auto-correlation, the trend, and the seasonality of the data, and they used the Seasonal Auto Regressive Integrated Moving Average (SARIMA) to perform short and long-term forecasting. They used a spreading model based on a grid system to observe and capture the spatial dependencies. The authors observed that different regions have a different number of users, and they observed mobility transferring from nearby regions. They used a decision tree to capture the influence of events, since they cause changes in traffic. They considered four types of events (holidays, weather, concerts, and news). For the residual information, they used a top-k regression tree.
Another explored topic is related to traffic. To predict the flow of crowds, in [ 51 ] it is proposed a framework called Forecasting Citywide Crowd Flows (FCCF). The authors used human mobility data, weather conditions, and road network data. First, they divided the human mobility data into two edge flow categories: inflow and outflow. Besides that, they split the region into small regions. Then, they decomposed the flows into seasonal, trend, and residual and built a model for each one of the flows. For the seasonal and trend components, they created an Intrinsic Guassian Markov-Random-Field (IGMRF) for each component. For the residual, they explored the spatiotemporal dependence and built a spatiotemporal residual model that uses a Bayesian network. Then, the models were aggregated to give the final prediction.
The authors in [ 52 ] proposed a multi-view network model called Deep Multi-View Spatial-Temporal Network (DMVST-NET). They observed that, in most cases, including a region that presents a weak correlation with the region we want to predict decreases the model’s performance. Usually, distant regions are less correlated, but this is not always true. Considering this all, the authors chose to create three views: a view for the temporal component, another for the spatial component (they only consider nearby regions), and the last one for semantic relations (the regions are far away but present similar demands). They used a Long Short-Term Memory (LSTM) for the temporal component, a Convolutional Neural Network (CNN) for the spatial component, and a Graph Neural Network (GNN) to capture the semantic relations.
In [ 53 ], the Multi-Task Learning Temporal Convolutional Neural Network (MTLTCNN) method is proposed for short-term passenger demand prediction. The authors started by using a Spatio-Temporal Dynamic Time Warping (ST-DTW) algorithm to select the most relevant features. The proposed method is multi-task, having one task per region. Each task comprises a Temporal Convolutional Neural Network (TCNN), and the tasks share information between them, namely spatiotemporal correlations. Ahmad Ali et al. [ 46 ] proposed an ANN model based on graphs and convolution to predict crowd flows. In addition, they explored spatiotemporal dependencies and external factors. The authors of [ 47 ] proposed an architecture that uses graphs, convolution, and recurrency to forecast traffic. Their approach explores spatiotemporal dependencies.
In 2018, Spyros Makridakis et al. [ 39 ] published the results of the fourth edition of a forecasting accuracy competition. This competition discouraged the submission of complicated ML models that required high computational capabilities. Most of the best methods were combinations of statistical models. One of the best methods was a hybrid ML (using Recurrent Neural Network (RNN)) and a statistical approach (exponential smoothing). Unfortunately, some of the submitted methods were based only on ML and achieved the worst results. Later in 2021, Spyros Makridakis et al. [ 40 ] published the results of the fifth edition of the forecasting accuracy competition. The goal was to predict the sales of a retail company represented by 42.840 time series. Most of the competitors used LightGBM-based methods, a ML method based on trees. In the top five, the first two top methods were essentially a weighted combination of LightGBM models, the third winner was a weighted combination of a Neural Network (NN), the fourth place was a non-recursive LightGBM, and the fifth was a recursive LightGBM.
A literature review on deep learning methods for financial time series forecasting [ 43 ] presented eight methods commonly used: Deep Multi Layer Perceptron (DMLPs), RNNs, LSTMs, CNNs, Restricted Boltzman Machines (RBMs), Deep Belief Networks (DBNs), Autoencoders (AEs), and Deep Reinforcement Learning (DRL). The authors highlight the preference of researchers in using RNNs, specially LSTMs, with financial data. However, as the authors identified, CNNs and Graph-based networks still need to be explored when using financial data. Meanwhile, Masini et al. [ 44 ] reviewed both ML and DL methods for financial forecasting; their main focus was NN, regression trees, bagging, and regression. The authors emphasized the use of ML models (including DL models) in the presence of large datasets.
Table 5 resumes the revised works. In this comparison, we did not include the survey articles. As we can visualize, different approaches emerged over the last years for both ML and DL methods. Most of the authors used more than one metric to compare the methods.
Figure 10 contains some of the methods used in forecast tasks. Forecasting may be accomplished using statistical methods or DL-based methods. Both approaches have advantages and disadvantages. Depending on the context, statistical methods may be more advantageous than DL methods and vice-versa. While statistical methods are explainable, they are usually more robust in short-time predictions, and they present the best results in short-time contexts. They are usually not suitable for long-term forecasting.
ANNs present some disadvantages. The first problem is to find the weights of the inputs. The training process will update the model weights in each iteration; however, the optimization algorithm used may not lead to the minimum error or loss and can lead to overfitting. The training process can be extensive, making its adoption difficult in some contexts. ANNs also require a lot of information and great computational power when compared with statistical methods.
One of the big problems with ML algorithms is the lack of transparency, especially in ANNs. ANNs are often seen as “black boxes” [ 41 ]. In order to solve this issue, a new topic has emerged in the scope of ML: explainable models. Explainability plays a crucial role in the understanding of a particular problem. A correct prediction is not always enough, since it can have real impacts in terms of security, ethics, mismatched objectives, privacy, and others [ 42 ].
The more relevant advantage of using DL based methods is the possibility of working with multidimensional data, in some cases exploring the relationships between space, time, and other factors that may influence the prediction. Statistical methods may be more beneficial regarding forecasting methods with real-time stream processing, since they are lighter. However, we should consider the application requirements, the data, and the threshold between execution time and other performance metrics.
We decided to compare the type of methods used in forecasting in terms of popularity over the years, highlighting the last years. Figure 11 contains the relationship between the number of documents retrieved from Scopus when we perform the query example Q7. As we can observe, the use of machine learning and deep learning for forecasting increased over the last few years.
Evolution of the popularity of type of methods regarding forecasting over the years. ML stands for Machine Learning, DL for Deep Learning, SL for Statistical Learning, and RL for Reinforcement Learning
TITLE-ABS-KEY ( forecasting AND ( “machine learning” OR “ml”) )
We also compared the methods used. Figure 12 contains the obtained results. Before 2018, the type of methods that were more mentioned were the ANNs. This can happen for two reasons: it was used the generic architecture of ANN, or the authors used the word when referring to a specific type of ANN. For instance, a LSTM is a type of ANN. Over the years, we can observe an increase in the use of LSTMs, CNN, RNNs, AE, and GNNs. The popularity of Deep Learning methods does not mean that the statistical ones are not important. It just reflects the evolution and trends of research methods.
Evolution of the popularity of methods regarding forecasting over the years. ANN stands for Artificial Neural Network, SVM for Support Vector Machine, LSTM for Long Short-Term Memory, A &S for ARIMA and SARIMA, RNN for Recurrent Neural Network, CNN for Convolution Neural Network, FNN for Feedforward Neural Network, AE for Autoencoder, GNN for Graph Neural Network, DBN for Deep Belief Network, LGBM for LightGBM, HA for Historical Average and RBM for Restricted Boltzmann Machines
Forecasting is an essential task when working with time series datasets. We can have different forecasting horizons, such as short, medium, and long-term. We can apply this type of method to different contexts and use cases.
Classical methods are mainly based on Auto-Regression. Regarding machine learning methods, LightGBM proved to be efficient. In the case of deep learning methods, the most used are based on LSTMs, CNNs, AEs, and GNNs. As we discussed, all methods have their positive and negative aspects. In addition, the application and intent of the problem can make the choice of the technique easier to select.
An anomaly occurs when something unexpected happens. We can observe anomalies in our daily lives, for instance, a cold day (as if it were winter) in the middle of the summer. We can visualize the anomalies in data. If we look for the chart that contains the daily temperatures measured in the summer, we would see an anomalous point in relation to the other points. However, not all anomalies are expressed in the same way. Anomalies can be classified by their nature, they can be a point anomaly, a contextual anomaly, or a collective anomaly [ 54 ].
A point anomaly can be identified when we compare it with the rest of the data [ 55 ]. Remembering the “cold day in the middle of the summer” example, if we only had data from the summer, we would have a point anomaly if the observed temperature was very different from all others.
A contextual anomaly happens in a particular context [ 55 ]. If we had data from the entire year, we would observe that in the winter there are low temperatures. The point is anomalous because it happens in the summer and not in the winter. This is similar to a conditional anomaly, which depends on the context to be classified as an anomaly.
A collective anomaly is a collection of points that are considered anomalous when compared with the remaining dataset [ 56 ]. They can be, for instance, an abrupt change in the temperature of the summer. Another example would be a day in which it is verified a smaller variation of temperatures. As we know, temperatures are higher in the summer. However, we can have fluctuation throughout the day. From the examples above, we can conclude that anomalies can also be present in time series, and can be isolated outliers or abrupt changes.
There are several challenges associated with the detection of anomalies. Anomalies are not always known or noticeable, and it is difficult to define what may be considered as anomalous. Besides that, there is always some noise associated with the anomaly detection. As an example, network attacks can change, evolve, and adapt, marking this as a complex problem, and allowing negative impacts to happen from the presence of false negatives and false positives in the analysis [ 54 , 57 ].
Anomalies are known for being rare in datasets. It is because of that property that they are considered anomalies. In a dataset containing anomalies, and if our goal is to identify them, we will have a class imbalance problem. This problem is amplified when dealing with big data. There are three different techniques to solve this issue [ 16 ]:
Data-based techniques: using sampling methods, we can reduce the level of imbalance;
Algorithm-based techniques: we can reduce the bias towards the majority group;
Hybrid techniques.
Learners can have difficulties identifying anomalies, especially in highly imbalanced datasets, such as decision trees and logistic regression [ 16 ]. Moreover, some classification metrics are more sensitive to imbalanced classes. Regarding the evaluation metrics, some metrics are highly affected and are not recommended, such as accuracy and error rate. Other metrics, such as precision, and recall, can be used, but they alone are usually not enough [ 16 ]. The F -measure metric is a weighted average of precision and recall and is highly used in this context.
To detect anomalies, statistical learning approaches can be used. In [ 58 ], Hochenbaum et al. used seasonal decomposing to extract the trend and the seasonal components. They proposed two techniques: the seasonal Extreme Studentized Deviate (ESD), and the seasonal hybrid ESD, which adds the median and the Median Absolute Deviation.
Some methods to detect anomalies are signal-based. In [ 59 ], the authors could effectively detect sharp increases in the local variance using wavelet filters and pseudo-spline filters. In [ 97 ], Muñoz et al. used correlation-based techniques.
Principal Component Analysis (PCA) based approaches were explored in [ 60 , 61 ]. In [ 60 ], the authors applied wavelet transformations to network traffic data. Then, it is applied PCA to extract the nature of anomalies. Finally, they use a mapping function to detect the anomalies. In [ 62 ], the authors could also localize the source of anomalies by incorporating the network structure information with the PCA model. They used the Karhunen Loève Expansion to get spatial and temporal correlations. In [ 61 ], the authors proposed the use of Minimum Covariance Determinant (MCD) with Robust Principal Component Analysis (rPCA). As PCA might have issues associated with introducing the outliers in the subspace, rPCA tackles it, with a computational cost. The use of MCD helps to ease the computational cost.
We can also find in the literature approaches based on the k-Nearest Neighbors (KNN) algorithm. In [ 63 ], the authors proposed a Transductive Confidence Machine (TCM) with KNN for online anomaly detection. They could improve their results by applying instance selection. The authors of [ 22 ] compared Naive Bayes, Support Vector Machine (SVM), and decision trees, and in [ 36 ] it is used Naive Bayes.
Anomaly detection methods
Several works are based on ANNs, such as [ 37 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 ]. In [ 64 ], motivated by the presence of a high rate of false alarms and improving accuracy, Hussain et al. proposed a FeedForward Neural Network (FNN) to detect anomalies in cellular networks. They accomplished high accuracy and a low False Positive Rate (FPR), proving the usefulness of FNNs. The work in [ 65 ] used a LSTM to detect network attacks through the anomalies present in data. They tested two types of baselines. In the first one, they only used cleaned data to train the model (without anomalies). In the second one, they used dirty data to train the model (with anomalies). They concluded that the dirty baseline models achieved the best results, which is good when no completely clean dataset exists. In [ 66 ], it is proposed the Parallel Subagging-GRU-based network (PSB-GRU)Parallel Subagging-GRU-based network (PSB-GRU) method. The model uses a Gated Recurrent Unit (GRU) network for long-term dependencies, a genetic algorithm to optimize the training process, the Spark platform to improve train efficiency, and subagging smoothly to improve the model’s generalization.
In [ 67 ], it is compared the performance of several RNN-based methods. The authors concluded that LSTM networks achieve the best results in terms of performance; however, the other RNN-based network also achieved good results. The works in [ 65 , 66 , 67 ] allow to conclude that sequential NN are suitable to detect anomalies. In [ 68 ], it is proposed a CNN-based method to extract spatio-temporal and other features from data with a threshold-based separation method to detect anomalies. The architecture had four convolutional layers. They achieved good results; however, they recognize that they need a more lightweight method to perform online anomaly detection. The authors of [ 74 ] also used a CNN. They were able to achieve better performance, in some cases, in architectures with one convolutional layer when compared with two or three convolution layers. However, their methods did not outperform RNN-based methods. The authors of [ 69 ] explored how CNNs can fail. The authors concluded that a one-pixel attack can mislead CNN-based networks. Increasing the number of layers (three convolution and three pooling layers) and retraining contributes to a more robust detection.
The authors of [ 70 ] proposed an ensemble method based on RBM and SVM. They tested their method in real time and achieved good performance. The work in [ 71 ] used Self-Organizing-Maps (SOM). Their model is computationally light, presenting results with a very low delay. In [ 37 ] the authors also use SOM with k - medoids , and they perform a two-step clustering. They achieved fast online detection and a multistage decision to distinguish different anomalies. In [ 72 ] it is proposed an autoencoder-based method with convolution. The use of autoencoders allowed the authors to capture non-linear correlations between features. The use of convolution has also reduced the training time. In [ 73 ], stacked autoencoders are used with a one-class classification model. The use of autoencoders allows the selection of the most relevant features and the reduction of data dimensionality.
Other approaches, such as the one proposed by [ 75 , 76 ] are tensor-based. A tensor is a structure similar to a multidimensional array with three or more dimensions. When we have one dimension, we have a vector (denoted as a first-order tensor), and if we have two dimensions, we have a matrix (second-order tensor) [ 76 ]. In [ 75 ], the proposed method is based on tensor decomposition. The method in [ 76 ] is based on tensor factorization, and we have a two-phase anomaly detection. Tensor-based methods are useful when we have complex data with high-dimensional orders.
Table 6 resumes the revised works for anomaly detection. We can visualize different types of methods. In anomaly detection, one of the most important tasks is the fair evaluation of the methods. Usually, in an anomaly detection problem, we have the class imbalance problem, as mentioned above. To compare better the evaluation metrics used, we decided to create Table 7 . False Positive Rate, True Positive Rate, and accuracy are the most frequently used metrics. The class imbalance highly affects the accuracy, and this metric should not be used, especially without other metrics.
Figure 13 contains some methods used in anomaly detection. Traditional statistical methods can fail in the face of big data and data with several dimensions. On the other side, ML methods can deal with high dimensionality. Supervised methods achieve good performance in detecting anomalies [ 6 ]. However, they have problems detecting new unseen types of anomalies. Unsupervised methods are good at detecting new anomalies [ 14 ].
Figure 14 contains the evolution of the popularity of the type of anomaly detection methods over the last few years. The use of statistical methods decreased while the use of deep learning methods increased. Currently, most of the published works use machine learning and deep learning. Similarly, Fig. 15 contains the evolution of the popularity of techniques over the last few years. As we can observe, methods such as PCA, SVM, and KNN lost popularity over time, while the focus evolved to the use of CNNs, RNNs, LSTMs and AE.
Evolution of the popularity of type of methods regarding anomaly detection over the years. ML stands for Machine Learning, DL for Deep Learning, SL for Statistical Learning, and RL for Reinforcement Learning
Evolution of the popularity of methods regarding anomaly detection over the years. ESD stands for Extreme Studentized Deviate, PCA for Principal Component Analysis, rPCA for Robust Principal Component Analysis, MCD for Minimum Covariance Determinant, KNN for k-Nearest Neighbors, NB for Naive Bayes, SVM for Support Vector Machine, DT for Decision Trees (and includes random forest), ANN for Artificial Neural Network, FNN for Feedforward Neural Network, LSTM for Long Short-Term Memory, RNN for Recurrent Neural Network, CNN for Convolution Neural Network, SOM for Self-Organizing-Maps, RBM for Restricted Boltzmann Machines, AE for Autoencoder and DBSCAN for Density-Based Spatial Clustering of Applications with Noise
As can be concluded from the above information, there are several methods that can be applied to anomaly detection. Regardless of the chosen method, we must take into consideration some problems associated with the nature of the data. The first class of problems that the methods can be vulnerable to are data poisoning attacks. In this context, a data poisoning attack might be something that we consider normal, being abnormal in the training phase. In [ 77 ], the authors deal with this problem by separating the training phase from the learning process.
Different methods should be considered when dealing with anomalies in data streams, since there is not one single method able to detect all types of anomalies. Furthermore, data streams are very susceptible to data poisoning attacks, since the use of supervised methods does not know the most recent data and needs to be regularly updated. Moreover, we should evaluate, once more, the threshold between execution time and other performance metrics. Finally, in the context of big data and ML, we should take into account that we are dealing with a class imbalance problem.
Data by itself can have no value for organizations and society. However, we can transform data into knowledge and improve decision-making through analysis. Nevertheless, dealing with big data can be a complex problem, especially when the data keeps growing over time. In this context, Stream Processing Engines emerged. They are an essential tool for processing big data in real-time. In this work, we presented some frameworks to process data streams in real-time, and we compared them. Spark is not a native streaming framework since it uses micro-batches, which brings some performance issues. However, Spark is the most popular framework with several exploratory data analysis and machine learning modules. On the other side, Flink can deal better with data-intensive applications, while Heron seems to scale better.
We also presented approaches to deal with common big data problems, such as forecasting and anomaly detection in real-time. Applying these algorithms in real time can be very beneficial for organizations. For instance, the use of forecasting can help organizations to optimize the use of services and resources. On the other side, using anomaly detection algorithms can prevent or minimize problems before they happen, such as network attacks. Finally, we discussed statistical, machine learning, and deep learning approaches. Statistical methods are more explainable and computationally lighter. On the other side, machine learning methods deal better with complex data and can predict longer times.
As future research directions, we would like to suggest real-time analytics and algorithms over big data time series streams. Namely, having time series related machine learning and deep learning algorithms take advantage of online learning for providing real-time analysis, forecasts, and anomaly detection. Another possible research direction is the development of explainable methods focused on time-series.
Not applicable.
https://hadoop.apache.org/ .
https://spark.apache.org/ .
https://flink.apache.org/ .
https://storm.apache.org/ .
https://heron.apache.org/ .
https://samza.apache.org/ .
https://aws.amazon.com/kinesis/ .
https://incubator.apache.org/ .
www.scopus.com.
Cox M, Ellsworth D. Application-controlled demand paging for out-of-core visualization. In: Proceedings of the 8th Conference on Visualization ’97. VIS ’97, pp. 235–244. IEEE Computer Society Press, Washington, DC, USA, 1997. https://doi.org/10.1109/VISUAL.1997.663888
Fan J, Han F, Liu H. Challenges of Big Data analysis. Natl Sci Rev. 2014;1(2):293–314. https://doi.org/10.1093/nsr/nwt032 .
Article Google Scholar
Gomes EHA, Plentz PDM, Rolt CRD, Dantas MAR. A survey on data stream, big data and real-time. Int J Netw Virtual Organ. 2019;20(2):143–67. https://doi.org/10.1504/IJNVO.2019.097631 .
Zhou B, Li J, Wang X, Gu Y, Xu L, Hu Y, Zhu L. Online internet traffic monitoring system using spark streaming. Big Data Mining Anal. 2018;1(1):47–56. https://doi.org/10.26599/BDMA.2018.9020005 .
Thudumu S, Branch P, Jin J, Singh J. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020. https://doi.org/10.1186/s40537-020-00320-x .
Es-Samaali H, Outchakoucht A, Benhadou S, Mounnan O, Abou El Kalam A. Anomaly detection for big data security: a benchmark. In: 2021 the 3rd International Conference on Big Data Engineering and Technology (BDET). BDET 2021, Association for Computing Machinery, New York, NY, USA 2021, pp. 35–39. https://doi.org/10.1145/3474944.3474950
Liu X, Buyya R. Resource management and scheduling in distributed stream processing systems: a taxonomy, review, and future directions. ACM Comput Surv. 2020. https://doi.org/10.1145/3355399 .
Sahal R, Breslin JG, Ali MI. Big data and stream processing platforms for industry 4.0 requirements mapping for a predictive maintenance use case. J Manuf Syst. 2020;54:138–51. https://doi.org/10.1016/j.jmsy.2019.11.004 .
Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. J Big Data. 2019;6(1):47. https://doi.org/10.1186/s40537-019-0210-7 .
Namiot D. On big data stream processing. Int J Open Info Technol. 2015;3(8):48–51.
Google Scholar
Wu Y. Network big data: a literature survey on stream data mining. J Softw. 2014. https://doi.org/10.4304/jsw.9.9.2427-2434 .
Inoubli W, Aridhi S, Mezni H, Maddouri M, Mephu Nguifo E. A comparative study on streaming frameworks for big data. In: Ziviani A, Hara CS, Ogasawara ES, de Macêdo JAF, Valduriez P, editors. LADaS@VLDB. Rio de Janeiro: CEUR-WS.org; 2018. p. 17–24.
Dai Q, Qian J. A distributed stream data processing platform design and implementation in smart cities. In: 2020 IEEE 3rd International Conference on Electronic Information and Communication Technology (ICEICT), 2020, pp. 688–693. https://doi.org/10.1109/ICEICT51264.2020.9334234
Ahmed M, Choudhury N, Uddin S. Anomaly detection on big data in financial markets. In: 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017, pp. 998–1001
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM. Machine learning with big data: challenges and approaches. IEEE Access. 2017;5:7776–97. https://doi.org/10.1109/ACCESS.2017.2696365 .
Johnson J, Khoshgoftaar T. Survey on deep learning with class imbalance. J Big Data. 2019;6:27. https://doi.org/10.1186/s40537-019-0192-5 .
Luo Y, Du X, Sun Y. Survey on real-time anomaly detection technology for big data streams. In: 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID), 2018, pp. 26–30. https://doi.org/10.1109/ICASID.2018.8693216
Zhu Y, Zhong XY. Data explosion, data nature and dataology. Brain Inform. 2009;5819:147–58. https://doi.org/10.1007/978-3-642-04954-5_25 .
Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag. 2015;35(2):137–44. https://doi.org/10.1016/j.ijinfomgt.2014.10.007 .
Trifunovic N, Milutinovic V, Salom J, Kos A. Paradigm shift in big data supercomputing: dataflow vs. controlflow. J Big Data. 2015. https://doi.org/10.1186/s40537-014-0010-z .
Arya M, Sastry GH. Deal-’deep ensemble algorithm’ framework for credit card fraud detection in real-time data stream with google tensorflow. Smart Sci. 2020;8(2):71–83. https://doi.org/10.1080/23080477.2020.1783491 .
Zhao S, Chandrashekar M, Lee Y, Medhi D. Real-time network anomaly detection system using machine learning. In: 2015 11th International Conference on the Design of Reliable Communication Networks (DRCN), 2015, pp. 267–270. https://doi.org/10.1109/DRCN.2015.7149025
Hennig L, Thomas P, Ai R, Kirschnick J, Wang H, Pannier J, Zimmermann N, Schmeier S, Xu F, Ostwald J, Uszkoreit H. Real-time discovery and geospatial visualization of mobility and industry events from large-scale, heterogeneous data streams. In: Proceedings of ACL-2016 System Demonstrations. Association for Computational Linguistics, Berlin, Germany 2016, pp. 37–42. https://doi.org/10.18653/v1/P16-4007. https://aclanthology.org/P16-4007
Baban P. Pre-processing and data validation in IOT data streams. In: Proceedings of the 14th ACM International Conference on Distributed and Event-Based Systems. DEBS ’20. Association for Computing Machinery, New York, NY, USA 2020, pp. 226–229. https://doi.org/10.1145/3401025.3406443
Kovacs A, Bogdandy B, Toth Z. Predict stock market prices with recurrent neural networks using NASDAQ data stream, 2021, pp. 449–454. https://doi.org/10.1109/SACI51354.2021.9465634
Bahri M, Bifet A, Gama J, Gomes HM, Maniu S. Data stream analysis: foundations, major tasks and tools. WIREs Data Min Knowl Discov. 2021;11(3):1405. https://doi.org/10.1002/widm.1405 .
Namiot D, Sneps-Sneppe M, Pauliks R. On data stream processing in IOT applications. In: Galinina O, Andreev S, Balandin S, Koucheryavy Y, editors. Internet of things, smart spaces, and next generation networks and systems. Cham: Springer; 2018. p. 41–51.
Chapter Google Scholar
Choudhary P, Garg K. Comparative analysis of spark and hadoop through imputation of data on big datasets. In: 2021 IEEE Bombay Section Signature Conference (IBSSC), 2021, pp. 1–6. https://doi.org/10.1109/IBSSC53889.2021.9673461
Karakaya Z, Yazici A, Alayyoub M. A comparison of stream processing frameworks. In: 2017 International Conference on Computer and Applications (ICCA), 2017, pp. 1–12 . https://doi.org/10.1109/COMAPP.2017.8079733
Nasiri H, Nasehi S, Goudarzi M. Evaluation of distributed stream processing frameworks for IOT applications in smart cities. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0215-2 .
Shahverdi E, Awad A, Sakr S. Big stream processing systems: an experimental evaluation. In: 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), 2019, pp. 53–60. https://doi.org/10.1109/ICDEW.2019.00-35
Wecel K, Szmydt M, Stróżyna M. Stream processing tools for analyzing objects in motion sending high-volume location data. Bus Inf Syst. 2021;1:257–68. https://doi.org/10.52825/bis.v1i.41 .
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. USENIX Association, USA 2010, p. 10
Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S. Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Association for Computing Machinery, New York, NY, USA 2015, pp. 239–250. https://doi.org/10.1145/2723372.2742788
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on apache spark. Int J Data Sci Anal. 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9 .
Ding N, Gao H, Bu H, Ma H. Radm:real-time anomaly detection in multivariate time series based on bayesian network. In: 2018 IEEE International Conference on Smart Internet of Things (SmartIoT), 2018, pp. 129–134. https://doi.org/10.1109/SmartIoT.2018.00-13
Qin X, Tang S, Chen X, Miao D, Wei G. Sqoe kqis anomaly detection in cellular networks: fast online detection framework with hourglass clustering. China Commun. 2018;15(10):25–37. https://doi.org/10.1109/CC.2018.8485466 .
Almeida A, Brás S, Oliveira I, Sargento S. Vehicular traffic flow prediction using deployed traffic counters in a city. Futur Gener Comput Syst. 2022;128:429–42. https://doi.org/10.1016/j.future.2021.10.022 .
Makridakis S, Spiliotis E, Assimakopoulos V. The m4 competition: results, findings, conclusion and way forward. Int J Forecast. 2018;34(4):802–8. https://doi.org/10.1016/j.ijforecast.2018.06.001 .
Makridakis S, Spiliotis E, Assimakopoulos V. M5 accuracy competition: results, findings, and conclusions. Int J Forecast. 2022. https://doi.org/10.1016/j.ijforecast.2021.11.013 .
Karlaftis MG, Vlahogianni EI. Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transp Res Part C Emerg Technol. 2011;19(3):387–99. https://doi.org/10.1016/j.trc.2010.10.004 .
Carvalho DV, Pereira EM, Cardoso JS. Machine learning interpretability: a survey on methods and metrics. Electronics (Switzerland). 2019. https://doi.org/10.3390/electronics8080832 .
Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput. 2020;90: 106181. https://doi.org/10.1016/j.asoc.2020.106181 .
Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. J Econ Surv. 2023;37(1):76–111. https://doi.org/10.1111/joes.12429 .
Khan W, Ghazanfar MA, Azam MA, Karami A, Alyoubi K, Alfakeeh A. Stock market prediction using machine learning classifiers and social media news. J Ambient Intell Humaniz Comput. 2022. https://doi.org/10.1007/s12652-020-01839-w .
Ali A, Zhu Y, Zakarya M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw. 2022;145:233–47. https://doi.org/10.1016/j.neunet.2021.10.021 .
Guo K, Hu Y, Qian Z, Liu H, Zhang K, Sun Y, Gao J, Yin B. Optimized graph convolution recurrent neural network for traffic prediction. IEEE Trans Intell Transp Syst. 2021;22(2):1138–49. https://doi.org/10.1109/TITS.2019.2963722 .
Zhu B, Ye S, Wang P, Chevallier J, Wei Y-M. Forecasting carbon price using a multi-objective least squares support vector machine with mixture kernels. J Forecast. 2022;41(1):100–17.
Article MathSciNet Google Scholar
Olu-Ajayi R, Alaka H, Sulaimon I, Sunmola F, Ajayi S. Building energy consumption prediction for residential buildings using deep learning and other machine learning techniques. J Build Eng. 2022;45: 103406. https://doi.org/10.1016/j.jobe.2021.103406 .
Zhang S, Zhao S, Yuan M, Zeng J, Yao J, Lyu MR, King I. Traffic prediction based power saving in cellular networks: a machine learning method. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’17. Association for Computing Machinery, New York, NY, USA 2017) https://doi.org/10.1145/3139958.3140053
Hoang MX, Zheng Y, Singh AK. FCCF: Forecasting citywide crowd flows based on big data. In: Proceeding of the 24rd ACM International Conference on Advances in Geographical Information Systems (ACM SIGSPATIAL 2016). ACM SIGSPATIAL 2016, 2016. https://www.microsoft.com/en-us/research/publication/forecasting-citywide-crowd-flows-based-big-data/
Yao H, Wu F, Ke J, Tang X, Jia Y, Lu S, Gong P, Ye J, Li Z. Deep multi-view spatial-temporal network for taxi demand prediction. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 2588–2595. AAAI Press, 2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16069
Zhang K, Liu Z, Zheng L. Short-term prediction of passenger demand in multi-zone level: temporal convolutional neural network with multi-task learning. IEEE Trans Intell Transp Syst. 2020;21(4):1480–90. https://doi.org/10.1109/TITS.2019.2909571 .
Junior G, Rodrigues J, Carvalho L, Al-Muhtadi J, Proença M. A comprehensive survey on network anomaly detection. Telecommun Syst. 2019. https://doi.org/10.1007/s11235-018-0475-8 .
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv 2009. https://doi.org/10.1145/1541880.1541882
Ahmed M, Naser Mahmood A, Hu J. A survey of network anomaly detection techniques. J Netw Comput Appl. 2016;60:19–31. https://doi.org/10.1016/j.jnca.2015.11.016 .
Zhu M, Ye K, Xu C-Z. Network anomaly detection and identification based on deep learning methods. In: Luo M, Zhang L-J, editors. Cloud computing–CLOUD 2018. Cham: Springer; 2018. p. 219–34.
Hochenbaum J, Vallis OS, Kejariwal A. Automatic anomaly detection in the cloud via statistical learning. CoRR abs/1704.07706 (2017). 1704.07706
Barford, P., Kline, J., Plonka, D., Ron, A.: A signal analysis of network traffic anomalies. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurment. IMW ’02. Association for Computing Machinery, New York, NY, USA 2002, pp. 71–82. https://doi.org/10.1145/637201.637210
Jiang D, Yao C, Xu Z, Qin W. Multi-scale anomaly detection for high-speed network traffic. Trans Emerg Telecommun Technol. 2015;26(3):308–17. https://doi.org/10.1002/ett.2619 .
Matsuda T, Morita T, Kudo T, Takine T. Traffic anomaly detection based on robust principal component analysis using periodic traffic behavior. IEICE Trans Commun E100.B(5), 2017, pp. 749–761 . https://doi.org/10.1587/transcom.2016EBP3239 .
Jiang R, Fei H, Huan J. A family of joint sparse PCA algorithms for anomaly localization in network data streams. IEEE Trans Knowl Data Eng. 2013;25(11):2421–33. https://doi.org/10.1109/TKDE.2012.176 .
Li Y, Lu T, Guo L, Tian Z, Qi L. Optimizing network anomaly detection scheme using instance selection mechanism. In: GLOBECOM 2009–2009 IEEE Global Telecommunications Conference, 2009, pp. 1–7. https://doi.org/10.1109/GLOCOM.2009.5425547
Hussain B, Du Q, Zhang S, Imran A, Imran MA. Mobile edge computing-based data-driven deep learning framework for anomaly detection. IEEE Access. 2019;7:137656–67. https://doi.org/10.1109/ACCESS.2019.2942485 .
Radford BJ, Apolonio LM, Trias AJ, Simpson JA. Network traffic anomaly detection using recurrent neural networks. CoRR 2018.
Tao X, Peng Y, Zhao F, Yang C, Qiang B, Wang Y, Xiong Z. Gated recurrent unit-based parallel network traffic anomaly detection using subagging ensembles. Ad Hoc Netw. 2021. https://doi.org/10.1016/j.adhoc.2021.102465 .
Ravi V, Kp S, Poornachandran P. Evaluation of recurrent neural network and its variants for intrusion detection system (IDs). Int J Inf Syst Model Des. 2017;8:43–63. https://doi.org/10.4018/IJISMD.2017070103 .
Nie L, Li Y, Kong X. Spatio-temporal network traffic estimation and anomaly detection based on convolutional neural network in vehicular ad-hoc networks. IEEE Access. 2018;6:40168–76. https://doi.org/10.1109/ACCESS.2018.2854842 .
Ogawa, Y., Kimura, T., Cheng, J.: Vulnerability assessment for machine learning based network anomaly detection system. In: 2020 IEEE International Conference on Consumer Electronics–Taiwan (ICCE-Taiwan), 2020, pp. 1–2 . https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258068
Garg S, Kaur K, Kumar N, Rodrigues JJPC. Hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in SDN: a social multimedia perspective. IEEE Trans Multimedia. 2019;21(3):566–78. https://doi.org/10.1109/TMM.2019.2893549 .
Sarasamma ST, Zhu QA, Huff J. Hierarchical kohonenen net for anomaly detection in network security. IEEE Trans Syst Man Cybern Syst. 2005;35(2):302–12. https://doi.org/10.1109/TSMCB.2005.843274 .
Chen Z, Yeo C, Lee B-S, Lau C. Autoencoder-based network anomaly detection. 2018 Wireless Telecommunications Symposium (WTS), 2018, p. 1–5. https://doi.org/10.1109/WTS.2018.8363930 .
Dai S, Yan J, Wang X, Zhang L. A deep one-class model for network anomaly detection. IOP Conf Ser Mater Sci Eng. 2019;563: 042007. https://doi.org/10.1088/1757-899X/563/4/042007 .
Kwon, D., Natarajan, K., Suh, S., Kim, H., Kim, J.: An empirical study on network anomaly detection using convolutional neural networks. 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018, pp. 1595–1598. https://doi.org/10.1109/ICDCS.2018.00178 .
Kasai H, Kellerer W, Kleinsteuber M. Network volume anomaly detection and identification in large-scale networks based on online time-structured traffic tensor tracking. IEEE Trans Netw Serv Manag. 2016;13(3):636–50. https://doi.org/10.1109/TNSM.2016.2598788 .
Xie K, Li X, Wang X, Xie G, Wen J, Cao J, Zhang D. Fast tensor factorization for accurate internet anomaly detection. IEEE/ACM Trans Netw. 2017;25(6):3794–807. https://doi.org/10.1109/TNET.2017.2761704 .
Moustafa N, Choo K-KR, Radwan I, Camtepe S. Outlier dirichlet mixture mechanism: adversarial statistical learning for anomaly detection in the fog. IEEE Trans Inf Forensics Secur. 2019;14(8):1975–87. https://doi.org/10.1109/TIFS.2018.2890808 .
Zhou J, Gandomi AH, Chen F, Holzinger A. Evaluating the quality of machine learning explanations: a survey on methods and metrics. Electronics. 2021. https://doi.org/10.3390/electronics10050593 .
Buhl H, Roeglinger M, Moser F, Heidemann J. Big data: a fashionable topic with(out) sustainable relevance for research and practice? Bus Inf Syst Eng. 2013;5:65–9. https://doi.org/10.1007/s12599-013-0249-5 .
Vigen T. Spurious correlations. 2022. https://www.tylervigen.com/spurious-correlations . Accessed 7 Sep 2022.
Google: google trends. 2022. https://trends.google.com/trends/explore . Accessed 07 Sept 2022.
Kobayashi L, Oyalowo A, Agrawal U, Chen S-L, Asaad W, Hu X, Loparo KA, Jay GD, Merck DL. Development and deployment of an open, modular, near-real-time patient monitor datastream conduit toolkit to enable healthcare multimodal data fusion in a live emergency department setting for experimental bedside clinical informatics research. IEEE Sensors Lett. 2019;3(1):1–4. https://doi.org/10.1109/LSENS.2018.2880140 .
Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–4. https://doi.org/10.1016/s1473-3099(20)30120-1 .
Schultz W, Javey S, Sorokina A. Smart water meters and data analytics decrease wasted water due to leaks. J Am Water Works Assoc. 2018;110(11):24–30. https://doi.org/10.1002/awwa.1124 .
Feuerriegel S, Dolata M, Schwabe G. Fair AI: challenges and opportunities. Bus Inf Syst Eng. 2020. https://doi.org/10.1007/s12599-020-00650-3 .
Confluent: what is streaming data? How it works, examples, and use cases. 2022. https://www.confluent.io/learn/data-streaming/ . Accessed 30 Aug 2022.
Flink A. Stateful computations over data streams. 2022. https://flink.apache.org/ . Accessed 28 Jun 2022.
Flink A. Windows: Apache Flink. 2022. https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/ h. Accessed 28 Jul 2022.
Lam C. Hadoop in action. 1st ed. USA: Manning Publications Co.; 2010.
of the ACM C: Apache spark: a unified engine for big data processing on VIMEO. 2022. https://vimeo.com/185645796 . Accessed 21 Jul 2022.
Hueske F. What is/are the main difference(s) between Flink and Storm? Stack Overflow. https://stackoverflow.com/a/30719138 . Accessed 28 Jun 2022.
Zhang Y. Building a better and faster Beam Samza runner: LinkedIn engineering. https://engineering.linkedin.com/blog/2020/building-a-better-and-faster-beam-samza-runner . Accessed 30 Jun 2022.
Foundation TAS. Apache Heron. A realtime, distributed, fault-tolerant stream processing engine. 2022. https://heron.apache.org/ . Accessed 30 Aug 2022.
Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. 3rd ed. Melbourne: OTexts; 2021.
Pal A, Prakash P. Practical time series analysis: master time series data processing, visualization, and modeling using python. UK: Packt Publishing; 2017.
Brownlee J. Introduction to time series forecasting with python: how to prepare data and develop models to predict the future. Machine Learning Mastery, San Juan, Puerto Rico, 2017. https://books.google.pt/books?id=-AiqDwAAQBAJ
Muñoz P, Barco R, Serrano I, Gómez-Andrades A. Correlation-based time-series analysis for cell degradation detection in son. IEEE Commun Lett. 2016;20(2):396–9. https://doi.org/10.1109/LCOMM.2016.2516004 .
Download references
This work is supported by FEDER, through POR LISBOA 2020 and COMPETE 2020 of the Portugal 2020 Project CityCatalyst POCI-01-0247-FEDER-046119. Ana Almeida acknowledges the Doctoral Grant from Fundação para a Ciência e Tecnologia (2021.06222.BD). Susana Brás is funded by national funds, European Regional Development Fund, FSE, through COMPETE2020 and FCT, in the scope of the framework contract foreseen in the numbers 4, 5 and 6 of the article 23, of the Decree-Law 57/2016, of August 29, changed by Law 57/2017, of July 19.
Authors and affiliations.
Instituto de Telecomunicações, Aveiro, Portugal
Ana Almeida, Susana Sargento & Filipe Cabral Pinto
Departamento de Eletrónica, Telecomunicações e Informática, Universidade de Aveiro, Aveiro, Portugal
Ana Almeida, Susana Brás & Susana Sargento
IEETA, DETI, LASI, Universidade de Aveiro, Aveiro, Portugal
Susana Brás
Altice Labs, Aveiro, Portugal
Filipe Cabral Pinto
You can also search for this author in PubMed Google Scholar
Conceptualization: AA; Data curation: AA; Formal analysis: AA; Investigation: AA; Methodology: AA; Software: AA; Validation: AA, SB; Visualization: AA; Writing—original draft: AA; Funding acquisition: SS; Project administration: SS; Supervision: SB, SS, FCP; Writing—review & editing: SB, SS, FCP. All authors read the final manuscript.
Correspondence to Ana Almeida .
Ethics approval and consent to participate, consent for publication, competing interests.
The authors declare that they have no competing interests.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Almeida, A., Brás, S., Sargento, S. et al. Time series big data: a survey on data stream frameworks, analysis and algorithms. J Big Data 10 , 83 (2023). https://doi.org/10.1186/s40537-023-00760-1
Download citation
Received : 12 October 2022
Accepted : 08 May 2023
Published : 28 May 2023
DOI : https://doi.org/10.1186/s40537-023-00760-1
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
Advancements in deep learning techniques for time series forecasting in maritime applications: a comprehensive review.
2. literature collection procedure.
3.1. artificial neural network (ann), 3.1.1. multilayer perceptron (mlp)/deep neural networks (dnn), 3.1.2. wavenet, 3.1.3. randomized neural network, 3.2. convolutional neural network (cnn), 3.3. recurrent neural network (rnn), 3.3.1. long short-term memory (lstm), 3.3.2. gated recurrent unit (gru), 3.4. attention mechanism (am)/transformer, 3.5. overview of algorithms usage, 4. time series forecasting in maritime applications, 4.1. ship operation-related applications, 4.1.1. ship trajectory prediction, 4.1.2. meteorological factor prediction, 4.1.3. ship fuel consumption prediction, 4.1.4. others, 4.2. port operation-related applications, 4.3. shipping market-related applications, 4.4. overview of time series forecasting in maritime applications, 5. overall analysis, 5.1. literature description, 5.1.1. literature distribution, 5.1.2. literature classification, 5.2. data utilized in maritime research, 5.2.1. automatic identification system data (ais data), 5.2.2. high-frequency radar data and sensor data, 5.2.3. container throughput data, 5.2.4. other datasets, 5.3. evaluation parameters, 5.4. real-world application examples, 5.5. future research directions, 5.5.1. data processing and feature extraction, 5.5.2. model optimization and application of new technologies, 5.5.3. specific application scenarios, 5.5.4. practical applications and long-term predictions, 5.5.5. environmental impact, fault prediction, and cross-domain applications, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.
Click here to enlarge figure
Ref. | Architecture | Dataset | Advantage |
---|---|---|---|
[ ] | MSCNN-GRU-AM | HF radar | It is applicable for high-frequency radar ship track prediction in environments with significant clutter and interference |
[ ] | CNN-BiLSTM-Attention | 6L34DF dual fuel diesel engine | The high prediction accuracy and early warning timeliness can provide interpretable fault prediction results |
[ ] | LSTM | Two LNG carriers | Enables early anomaly detection in new ships and new equipment |
[ ] | LSTM | sensors | better and high-precision effects |
[ ] | Self-Attention-BiLSTM | A real military ship | Not only can it better capture complex ship attitude changes, but it also shows greater accuracy and stability in long-term forecasting tasks |
[ ] | CNN–GRU–AM | A C11 containership | better accuracy of forecasting |
[ ] | GRU | A scaled model test | good prediction accuracy |
[ ] | CNN | A bulk carrier | good prediction accuracy |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Wang, M.; Guo, X.; She, Y.; Zhou, Y.; Liang, M.; Chen, Z.S. Advancements in Deep Learning Techniques for Time Series Forecasting in Maritime Applications: A Comprehensive Review. Information 2024 , 15 , 507. https://doi.org/10.3390/info15080507
Wang M, Guo X, She Y, Zhou Y, Liang M, Chen ZS. Advancements in Deep Learning Techniques for Time Series Forecasting in Maritime Applications: A Comprehensive Review. Information . 2024; 15(8):507. https://doi.org/10.3390/info15080507
Wang, Meng, Xinyan Guo, Yanling She, Yang Zhou, Maohan Liang, and Zhong Shuo Chen. 2024. "Advancements in Deep Learning Techniques for Time Series Forecasting in Maritime Applications: A Comprehensive Review" Information 15, no. 8: 507. https://doi.org/10.3390/info15080507
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
International Journal of Retina and Vitreous volume 10 , Article number: 60 ( 2024 ) Cite this article
Metrics details
Uncorrected myopia is a leading cause of blindness globally, with a rising prevalence in recent decades. Pathological myopia, often seen in individuals with increased axial length (AXL), can result in severe structural changes in the posterior pole, including myopic tractional maculopathy (MTM). MTM arises from tractional forces at the vitreoretinal interface, leading to progressive macular retinoschisis, macular holes, and retinal detachment (RD). This study aims to outline preoperative evaluation and surgical indication criteria for MTM, based on the MTM staging system, and to share our Brazilian experience with three cases of macular buckle (MB) surgery, all with over a year of follow-up.
We conducted a retrospective analysis of three cases of MTM-associated RD treated with MB surgery, with or without pars plana vitrectomy. Preoperative evaluations included optical coherence tomography (OCT) and ultrasonography (USG) to assess the extent of macular involvement and retinal detachment. Surgical indications were determined based on the MTM staging system. The MB was assembled using customizable and accessible materials. Surgical procedures varied according to the specific needs of each case. An informed consent form regarding the surgical procedure was appropriately obtained for each case. The study was conducted with the proper approval of the institution’s ethics committee.
All three cases demonstrated successful retinal attachment during the mean follow-up of eighteen months. In the first case, combined phacoemulsification, vitrectomy, and MB were performed for MTM with macular hole and RD. The second case required MB and vitrectomy after two failed RD surgeries. In the third case, a macular detachment with an internal lamellar hole was treated with MB alone. These cases highlight the efficacy of MB surgery in managing MTM in highly myopic eyes.
MB surgery is an effective treatment option for MTM-associated RD in highly myopic eyes, providing long-term retinal attachment. Our experience demonstrates that with proper preoperative evaluation and surgical planning, MB can be successfully implemented using accessible materials, offering a viable solution in resource-limited settings. Further studies with larger sample sizes are warranted to validate these findings and refine surgical techniques.
Uncorrected myopia is considered one of the leading causes of blindness worldwide [ 1 ], and its prevalence has grown significantly in recent decades [ 2 ]. Specifically, in myopic individuals with increased axial length (AXL), structural changes may occur in the posterior pole that characterizes pathological myopia, including posterior staphyloma, myopic macular degeneration, optic neuropathy associated with myopia, and myopic tractional maculopathy (MTM) [ 3 , 4 ]. The incidence of pathological myopia increases with age but can also occur in younger patients [ 5 ]. The impact of myopic maculopathy lies in its frequent occurrence in both eyes, its irreversibility, and its potential to affect individuals of working age [ 6 ].
MTM is a specific condition of pathological myopia secondary to tangential and anteroposterior tractional alterations at the vitreoretinal interface, where the retina is unable to adapt to the progressive increase in AXL and ends up undergoing structural changes. Characteristically, it involves a progressive combination of macular retinoschisis, lamellar or full-thickness macular holes, and, ultimately, retinal detachment (RD) [ 1 ]. Hence, while antiangiogenic therapy is used to treat neovascular membranes and there is no treatment for atrophic changes, MTM, and its complications require precise surgical interventions, and Macular buckle (MB) surgery, with or without vitrectomy, is one of the surgical techniques options.
In this study, we present the historical aspects of MB, discussing preoperative evaluation and criteria for surgical indication. Hereby we also discuss our experience with MB surgery cases, describing the assembly of a customizable MB using accessible materials.
The surgical treatment of RD has undergone revolutionary advancements following the theory developed by Jules Gonin in 1921, which involved surgically blocking tears and breaks in the retina [ 2 ]. However, it was soon understood that cases of surgical failure were related to the traction exerted by the vitreous on areas of retinal discontinuity, perpetuating the infiltration of subretinal fluid [ 3 , 4 ]. In an attempt to alleviate this traction by approximating the underlying choroid to the detached retina, several authors proposed techniques such as subchoroidal injection of plasma, transient indentation with gauze, or even a piece of plastic sutured to the sclera near the treated area [ 5 , 6 ]. In 1957, Schepens conceived the technique now known as scleral buckling, revolutionizing retinal surgery, and also proposing some adaptations for the treatment of the macular region in cases of retinal detachment associated with macular holes by positioning the buckle beneath the macular region [ 6 ].
Over time, other MB techniques were developed by different authors [ 7 , 8 , 9 , 10 , 11 , 12 ]. In 1980, Ando [ 13 ] created the first solid silicone MB, facilitating its implantation without the need for muscle disinsertion or suturing of the implant to the thinned posterior sclera. However, it presented limitations such as the adjustment of force and interference in imaging exams due to the presence of embedded metal [ 14 ]. In 2012, Stirpe et al. developed a new MB that did not contain metal wires and had adjustable sutures [ 15 ], while Mateo et al. proposed the coupling of an illuminated probe to facilitate the precise positioning of Ando’s MB beneath the macula [ 16 ].
Unfortunately, Ando’s device presents limitations regarding shape, tension adjustment, and posterior suture thus hindering its reproducibility. Hence, certain authors explored alternative methods to tailor their implants, such as utilizing silicone sponges internally coated with stainless steel [ 17 ] or employing a titanium stent [ 18 , 19 ], as described by Parolini et al. (2013). In their report, Parolini et al. detailed three cases where they utilized MB exclusively for macular detachment unrelated to macular holes. Additionally, they introduced a novel L-shaped design of MB devoid of posterior sutures, enhancing its feasibility for surgical implementation [ 18 ].
In Brazil, there are no commercially available MBs, so we chose to manufacture one following the descriptions provided by Parolini et al. [ 18 ], as we will describe throughout this article.
Macular buckle surgery requires a comprehensive preoperative ophthalmological assessment and complementary imaging exams to assist in the classification of MTM and surgical planning. Here, we highlight and discuss ocular ultrasonography (USG) and optical coherence tomography (OCT).
The importance of USG in the surgical planning of MB procedures lies in its ability to assess vitreous and retinal conditions, such as the presence of anteroposterior vitreoretinal tractions (VMT) and/or tears, and to locate and estimate the extent of RD. OCT can also be useful for identifying VTM, but standard OCT does not have sufficient width and depth to capture the entire retinal detachment. Sometimes, in eyes with very high myopia, it is challenging to acquire images of the macular holes and, in these cases, examining with the patient using contact lenses can provide better image acquisition. As wide-field OCT is not available in Brazil, USG is very useful in these situations.
USG also aids in selecting the appropriate surgical technique and determining the indication for MB [ 18 , 19 ]. Moreover, it facilitates the measurement of AXL in cases where optical biometry is unreliable, allows for the accurate calculation of intraocular lens power using the immersion technique to avoid corneal compression [ 21 ], assists in identifying structures in cases of media opacity, and ensures accurate intraoperative positioning and postoperative follow-up of the MB. Regarding the anesthetic procedure, USG is essential in evaluating the size of the staphyloma, helping to select the most suitable anesthetic method for highly myopic eyes (retrobulbar block or subtenon anesthesia) to avoid complications such as ocular perforation or intraocular injection of anesthetic in significantly large eyes [ 22 , 23 , 24 ].
Optical coherence tomography
The diagnosis and monitoring of MTM can be challenging due to the atrophic changes associated with pathological myopia. In this context, OCT has emerged as a fundamental diagnostic method for the non-invasive and detailed evaluation of the vitreoretinal interface, retinal layers, the retinal pigment epithelium, and the choroid, allowing for a better understanding and classification of these structures, as described below [ 25 , 26 , 27 , 28 ].
The evaluation of OCT and the correct interpretation of findings are essential steps in surgical indication in MTM. In 2021, Parolini et al. [ 27 , 28 , 29 , 30 ] introduced a new OCT classification for MTM, which has strong reproducibility between examiners, intending to streamline information sharing and improve understanding of disease progression. [ 29 ]. The MTM staging system (MSS) categorizes findings into two types of evolution: perpendicular and tangential. Perpendicular evolution describes the anatomical sequence of predominantly internal or inner retinoschisis (stage 1), predominantly external retinoschisis (stage 2), retinoschisis with macular detachment (stage 3), and complete macular detachment without schisis (stage 4). Tangential evolution, in turn, describes the anatomical sequence of preserved foveal contour (a), internal lamellar macular hole (b), and full-thickness macular hole (c). This classification allows for the combination of evolution types, facilitating disease categorization. The occurrence of external lamellar macular holes is described in the classification as “O”, which can happen at any stage, while the presence of epiretinal abnormalities is indicated as “Plus” [ 28 ].
Based on the MSS, a surgical management approach for MTM was proposed. The idea is that comparing MB vitrectomy and pars plana vitrectomy (PPV) alone does not make sense, as each approach has its value in treatment. Early-stage cases warrant observation (stages 1a and 2a), while intervention is reserved for those who experience a progressive decline in visual acuity (stages 1b and 2b). When tangential forces predominate, PPV alone presents good results in stages 1a, with significant epiretinal membrane, and 1b and 1c.
In cases where perpendicular evolution predominates, MB alone has proven effective in stages 2b, 3a, 3b, 4a, and 4b. If epiretinal abnormalities are identified as clinically significant for visual improvement following the MB procedure, rapprochement with PPV remains a viable option. Finally, in cases where perpendicular and tangential forces are present, leading to macular involvement and/or macular or retinal detachment, MB + PPV is indicated (stages 2c, 3c, and 4c). The presence of “plus” alterations may require surgical intervention to improve complaints of metamorphopsia. Table 1 summarizes OCT findings and their implications in surgical indication [ 30 ].
Based on the criteria outlined by Parolini et al. [ 28 , 29 , 30 ], we sought to share our experience in this small case series, where all patients underwent MB surgery, with or without PPV, and have been followed up for over a year. Additionally, we will outline the methodology employed for the MB procedure and offer a concise analysis of the results, correlating them with the current literature.
This retrospective study analyzed three patients with MTM-associated RD treated with MB surgery, with or without PPV. Preoperative evaluations used OCT and USG to determine macular involvement and the extent of RD. Surgical indications were guided by the MTM staging system, and the MB was assembled using customizable materials. Procedures were tailored to the specific needs of each patient. All participants provided written informed consent. The study received approval from the ethics committee of the Clinical Hospital of the University of São Paulo, Ribeirão Preto, SP, Brazil, and adhered to the principles of the Declaration of Helsinki.
We describe the surgical management of three cases of highly myopic eyes with MTM, where MB surgery was performed. In cases 1 and 2, RD was associated with a macular hole (MH). In case 2, the indication for MB was due to two previous failures of vitreoretinal surgery (PPV) for the treatment of retinal detachment with a macular hole. In case 3, a macular detachment was associated with an internal lamellar hole. Table 2 summarizes the main findings of each case, and Figs. 1 , 2 and 3 illustrate them.
a : Color fundus photographs of wide-field preoperative imaging, showing retinal detachment in the posterior pole with a macular hole in the left eye (OS); b : Postoperative color fundus photography of the OS with attached retina and a residual gas bubble; c : Preoperative USG evidencing retinal detachment and posterior staphyloma; d : Intraoperative USG evidencing correct positioning of the buckle flattening the posterior staphyloma; e : Preoperative OCT showing a retinal detachment with associated macular hole; f : Postoperative OCT showing a reattached retina with a grade 2 macular hole closure exhibiting applied edges (grade 2 closure, Kang et al.’s classification [ 31 ])
a : Ultrasound of the left eye shows retinal detachment; b : Postoperative OCT reveals attached retina; c : Postoperative color fundus photography of the left eye demonstrates a reattached retina
a : Preoperative USG showing a large posterior staphyloma with macular detachment (arrow); b : Postoperative USG evidencing flattening of the posterior staphyloma due to the positioning of the buckle; c : Preoperative OCT showing an internal lamellar hole with macular detachment and nasal macular retinoschisis. Vitreomacular adhesion can also be observed; d : Postoperative OCT evidencing flattening of the posterior staphyloma, resolution of the lamellar hole, and macular detachment, as well as reduction of retinoschisis; the vitreomacular adhesion remains stable; e : fundus retinography showing attached retina
one 1.5-mm titanium microplate for osteosynthesis containing 8 holes Traumec ® (Medical Support, Brazil); one 270 sleeve-type band (Labitician, USA); one 506G oval sponge (Labitician, USA); one 15-degree blade; pliers, and strong scissors (Fig. 4 a).
We used a titanium osteosynthesis plate containing 16 holes, which was cut in half (8 holes) using strong scissors (or pliers), creating the ideal size for our implant. This plate was then inserted into a 270 sleeve-type band (sleeve), covering its entire surface, with the help of Kelly forceps to open the sleeve and facilitate plate insertion, preventing any tearing. Approximately 2.0 mm of the band should be left beyond the plate on the vertical portion to protect the extremity and prevent conjunctival erosion after fixation. The plate is then bent into an “L” shape using pliers, leaving 3 holes horizontally (short arm of the L) and 5 holes vertically (long arm of the L). Next, a tunnel is made in the middle of the linear length of the 506G sponge with a 15-degree blade, ensuring it is longer than the short arm of the titanium plate to cover it, and without letting the tunnel pierce the sponge (to avoid plate exposure). Finally, the short arm of the L-shaped plate is inserted into the 506G sponge through the tunnel, and the 506G sponge should then be cut to cover the short arm of the implant, leaving at least 1.0 mm beyond the implant length to prevent exposure beyond the sponge (Fig. 4 a-c).
The initial procedures remain similar, whether isolated MB surgery or combined surgery with vitrectomy is performed. The procedure begins with a temporal peritomy at the limbus of the conjunctiva and Tenon’s capsule from 11 to 4 o’clock. The lateral and superior rectus muscles were isolated using a suture of silk thread 2.0 (Ethicon, Johnson & Johnson, Brazil) to promote eye motility. Before positioning the implant, anterior chamber paracentesis is performed to reduce intraocular pressure (IOP) and minimize pressure changes when positioning the MB. Next, the implant is placed in the upper temporal quadrant, where the shorter arm will be positioned under the macula, and the longer arm should be inserted parallel to the lateral rectus muscle (Fig. 4 d). After, a 25-gauge Chandelier optic fiber is positioned at 6 o’clock (Alcon Constellation Vision System, USA) to enable visualization of the fundus.
Subsequently, we confirm the proper positioning of the implant under the macular region using a panoramic visualization system coupled to a microscope (Resight 500 ® , Zeiss) with delicate manipulation of the implant. Once the MB positioning is confirmed, the vertical portion of the device (long arm) is sutured to the sclera using 5.0 Mersilene ® suture (Ethicon, Johnson & Johnson, Brazil) with 2 separate stitches. In order to confirm the proper positioning of the MB, we perform preoperative USG, covering the USG probe and cable with a sterile plastic cover, and at the same time, it is possible to measure the comparative AXL.
a : Material to be used for the fabrication of the macular buckle b : Schematic figure of the shape to be molded for the buckle; c : MB fabricated in the operating room for the described cases; d : Postoperative aspect of the correctly positioned macular buckle; it can be observed under the conjunctiva in the upper temporal quadrant
As reported above, in two cases, where there was retinal detachment associated to MH, we performed combined MB and PPV surgery (cases 1 and 2), and after positioning the MB, we routinely carried out PPV surgery. In case one, besides PPV and MB, phacoemulsification was carried out, and C3F8 was chosen as a vitreous substitute. In case 2, due to the history of previous PPV and retinal re-detachment with MH, it was decided to use silicone oil as a vitreous substitute in addition to MB. One case presenting an internal lamellar hole (stage 4b) with macular detachment and nasal macular retinoschisis (patient 3) was managed only with MB, despite slight vitreomacular adherence, which was not considered significant.
In the immediate postoperative period of the three cases operated at our service, the patients presented with slight hyperemia, mild pain improved with analgesic (dipyrone), and none showed increased IOP. Patient 3 presented with retinal hemorrhage in the posterior pole in the immediate postoperative period, probably due to the significant reduction of the large preoperative staphyloma after MB implantation. The approach was expectant, and there was complete absorption of the hemorrhage, and progressive reabsorption of the subretinal fluid, leading to the repositioning of the macula throughout the following months, despite a stable vitreomacular adhesion may be seen. In patient 1, during follow-up, the attached retina and grade 2 closure of the macular hole were observed (according to Kang et al.’s classification) [ 31 ]. Patient 2 evolved also with retina applied, macular hole closure, and silicon oil. There were no reports of diplopia among the operated patients and/or limitations in ocular mobility.
All three patients (100%) showed visual acuity improvement after surgery, maintaining retina attached and stable vision for more than a year of follow-up. No patient (100%) experienced complications such as conjunctival erosion, displacement/rotation of the MB, endophthalmitis, or anterior chamber reactions throughout the follow-up period.
The use of MB surgery significantly decreased in the 1980s with the advancement of vitrectomy, primarily because of technical difficulties and the lack of related scientific studies at that time [ 32 , 33 ]. Nonetheless, in highly myopic eyes with posterior staphyloma, PPV can result in surgical failures in 26.7 to 50% of cases due to the inability to alter the axial length of the eye and reduce the anteroposterior forces exerted by the staphyloma [ 34 ]. The use of MB in these circumstances can reduce the anteroposterior force, providing positive results. This evidence, combined with the relevant study by Sasoh et al., which demonstrated good results and safety of MB use in the early 2000s, encouraged the resumption of studies and the development of the MB technique [ 35 ].
In 2001, Ripandelli et al. [ 36 ], compared highly myopic patients with retinal detachment and macular holes undergoing vitrectomy via pars plana (group A) and MB surgery (group B). They observed a surgical success rate of 73.3% in group A and 93.3% in group B, with group B also showing a significant improvement in vision, unlike the vitrectomy group. These results suggested anatomical and functional superiority when MB was used. Similarly, Ando et al., in 2007, reported anatomical success in the MB group in 93.3% of cases after the first surgery and 100% after the second procedure, while only 50% of the cases treated with vitrectomy achieved retinal reattachment in the first procedure, and 86% in the second approach, which was associated with MB [ 37 ].
In a literature review, Alkabes and Mateo [ 32 ] showed that after MB surgery, the retinal reattachment rate ranged from 81.8 to 100%, while the MH closure rate ranged from 40 to 93.3%. Although persistent MH was identified as a risk factor for retinal re-detachment, eyes with persistent MH that underwent MB did not experience retinal re-detachment. Furthermore, the literature indicates that patients with AXLs greater than 30 mm have a higher risk of early retinal re-detachment after PPV. Several studies have shown statistically significant higher rates of retinal re-detachment after PPV for treating RD associated with MH in patients with AXL > 30 mm [ 38 , 39 , 40 ]. For these patients, when undergoing the MB procedure, the retina was reattached in 100% of cases and the MH closure rate ranged from 40 to 100%. Notably, no re-detachment was observed in cases of persistent MH [ 32 ]. In our two cases involving RD and MH that underwent combined surgery, both achieved successful outcomes with retinal reattachment and macular hole closure, with no retinal re-detachment observed.
In general, outcomes of both PPV or MB procedures have been shown to be effective in improving retinal anatomy and visual acuity. However, PPV, particularly when combined with internal limiting membrane (ILM) peeling, is associated with a higher incidence of postoperative MH. Due to the lack of randomized studies, it is challenging to determine if MB or PPV is superior for treating progressive macular foveoschisis. Given its progressive nature and potential for RD with MH, surgical intervention should be considered if the schisis progresses or visual acuity decreases. Regular OCT monitoring and early interventions based on physician experience are recommended [ 32 , 41 , 42 ].
Regarding complications, patient 3 experienced retinal hemorrhage following MB surgery, which resolved spontaneously within one month. This patient had a deep staphyloma of the posterior pole, and after MB, the AXL was significantly reduced by 7.9 mm. Despite performing a paracentesis at the beginning of the procedure, no hypotony was observed. We attributed the retinal hemorrhage to the pronounced reduction in AXL. Mateo and colleagues previously described cases where excessive compression of the choroidal vessels could lead to increased local hydrostatic pressure and changes in the RPE, resulting in subretinal fluid and, in some cases, macular atrophy [ 32 , 43 ]. However, we did not observe any of these complications in patient 3 or the other patients.
Other potential complications reported in various case series include scleral perforation, orbital fat prolapse, improper positioning of the explant, and ocular muscle disinsertion during buckle placement [ 32 ]. During the mean follow-up period of eighteen months, no issues such as intraocular pressure changes, strabismus, eye movement restriction, explant displacement, choroidal effusion, choroidal detachment, or posterior pole atrophy were observed.
As demonstrated by Parolini et al., the management of MTM can range from using MB alone to performing combined surgeries. When full-thickness macular holes and macular or retinal detachment are present, a combination of PPV and MB is recommended, as each surgical method targets different force vectors affecting MTM [ 29 , 30 ].
Despite the positive outcomes demonstrated in this report and the literature, MB can present complications. It is essential to carefully evaluate the risk-benefit ratio carefully and reserve its use for cases where it is truly necessary, based on an appropriate classification system. Therefore, we recommend considering MB + PPV surgery as the first choice for highly myopic patients with macular RD associated with MH, given the high rates of retinal re-detachment. In our small case series reported herein, success was achieved with combined surgery in two of our cases and MB alone in one case, proving to be effective in improving anatomical and functional outcomes without the need for additional interventions. None of the patients experienced re-RD with combined surgery or MB alone, which is consistent with the literature.
Finally, it is important to emphasize that the contralateral eye of all three patients continues to be followed up with OCT and fundoscopy. Macular buckling should be considered if any anatomical or visual deterioration occurs, depending on the classification of tractional maculopathy.
MB has proven to be effective in our small experience, whether alone or conjunction with PPV, in managing MTM. Its indication should consider the pathophysiological mechanism of MTM, which is influenced by tangential and anteroposterior forces, with PPV often needing to be combined in many cases. Decision-making should be based on the patient’s evolution regarding symptoms of decreased vision, anatomical findings on fundoscopy, ocular ultrasound, and based on OCT classification. The postoperative results reported here, and in the literature, have shown good anatomical and functional results, the absence of recurrence of retinal detachment, showing that the macular buckle can contribute to better results in eyes with very long axial lengths.
No datasets were generated or analysed during the current study.
Axial length
Counting fingers
Intraocular pressure
Macular buckle
Macular hole
MTM staging system
Myopic tractional maculopathy
Pars plana vitrectomy
Retinal detachment
Silicon oil
Phacoemulsification
Ultrasonography
Visual acuity
Flitcroft DI, He M, Jonas JB, Jong M, Naidoo K, Ohno-Matsui K, et al. IMI – defining and classifying myopia: a proposed set of standards for clinical and epidemiologic studies. Invest Ophthalmol Vis Sci. 2019;60(3):M20–30.
Article PubMed PubMed Central Google Scholar
Morais FB. Jules Gonin and the Nobel Prize: Pioneer of retinal detachment surgery who almost received a Nobel Prize in medicine. 4, Int J Retina Vitreous. 2018.
Schepens CL. Progress in detachment surgery. Trans Am Acad Ophthalmol Otolaryngol. 1951;55.
Schepens CL. Clinical aspects of pathologic changes in the vitreous body. Am J Ophthalmol. 1954;38(1 PART 2).
Custodis E. Die Behandlung Der Netzhautablösung durch umschriebene diathermiekoagulation und einer mittels Plombenaufnähung Erzeugten Eindellung Der Sklera Im Bereich Des Risses. Klin Monbl Augenheilkd Augenarztl Fortbild. 1956;129(4).
Schepens CL, Okamura ID, Brockhurst RJ. The Scleral Buckling Procedures. Surgical Techniques and Management. AMA Arch Ophthalmol [Internet]. 1957;58(6):797–811. http://archopht.jamanetwork.com/
Rosengren B. The silver plomb method in macular holes. Trans Ophthalmol Soc U K. 1966;86:49–53.
CAS PubMed Google Scholar
Theodossiadis GP. A simplified technique for the surgical treatment of retinal detachments resulting from macula holes (author’s transl)]. Klin Monbl Augenheilkd. 1973;162(6):719–28.
Siam A. Macular hole with central retinal detachment in high myopia with posterior staphyloma. Br J Ophthalmol. 1969;53(1):62–3.
Article CAS PubMed PubMed Central Google Scholar
Klöti R. Silver clip for central retinal detachments with macular hole. Mod Probl Ophthalmol. 1974;12(0):330–6.
PubMed Google Scholar
Feman SS, Hepler RS, Straatsma BR. Rhegmatogenous retinal detachment due to macular hole. Management with cryotherapy and a Y-shaped sling. Arch Ophthalmol. 1974;91(5):371–2.
Article CAS PubMed Google Scholar
Landolfo V, Albini L, Romano A. Macular hole-induced retinal detachment: treatment with an armed-silicone implant. Ophthalmic Surg. 1986;17(12):810–2.
Ando F. Use of a special macular explant in surgery for retinal detachment with macular hole. Jpn J Ophthalmol. 1980;24:29–34.
Google Scholar
Susvar P, Sood G. Current concepts of macular buckle in myopic traction maculopathy. Indian Journal of Ophthalmology. Volume 66. Wolters Kluwer Medknow; 2018. pp. 1772–84.
Stirpe M, Ripandelli G, Rossi T, Cacciamani A, Orciuolo M. A new adjustable macular buckle designed for highly myopic eyes. Retina. 2012;32(7):1424–7. https://doi.org/10.1097/IAE.0b013e3182550648 .
Article PubMed Google Scholar
Mateo C, Marco, Medeiros D, Alkabes M, Burés-Jelstrup A, Postorino M et al. Illuminated Ando Plombe for Optimal Positioning in Highly Myopic Eyes With Vitreoretinal Diseases Secondary to Posterior Staphyloma [Internet]. http://archopht.jamanetwork.com/
Mortada HA, Mortada HA. A Novel Episcleral Macular Buckling: wire-strengthened sponge exoplant for recurrent Macular Hole and Retinal detachment in high myopic eyes. Volume 2. Discovery & I nnovation Ophthalmology Journal; 2013.
Parolini BMD, Frisina RMD, Pinackatt SMD, Mete. Maurizio MD † . A New L-Shaped Design of Macular Buckle to Support a Posterior Staphyloma in High Myopia. Retina 33(7):p 1466–1470, July/August, 2013. | https://doi.org/10.1097/IAE.0b013e31828e69ea
Parolini B, Frisina R, Pinackatt S, Gasparotti R, Gatti E, Baldi A et al. Indications and results of a New L-shaped Macular Buckle to support a posterior staphyloma in high myopia. Retina. 2015.
Ahmed J, Shaikh F, Rizwan A, Memon MF, Ahmad J. Evaluation of Vitreo-Retinal pathologies using B-Scan Ultrasound. Pak J Ophthalmol. 2009;25(4).
Gulkilik G, Ustuner A, Ozdamar A. Comparison of optical coherence biometry and applanation ultrasound biometry in high-myopic eyes with posterior Pole staphyloma. Ann Ophthalmol. 2007;39(3).
Palte HD. Local and Regional Anesthesia Dovepress Ophthalmic regional blocks: management, challenges, and solutions. Local Reg Anesth. 2015;8(AUGUST).
Qureshi MA, Laghari K. Role of B-scan ultrasonography in pre-operative cataract patients. Int J Health Sci (Qassim). 2010;4(1).
Shinar Z, Chan L, Orlinsky M. Use of ocular ultrasound for the evaluation of retinal detachment. J Emerg Med. 2011;40(1).
Alanazi R, Schellini S, AlSheikh O, Elkhamary S. Scleral buckle induce orbital cellulitis and scleritis – a case report and literature review. Saudi J Ophthalmol. 2019;33(4).
Huang D, Swanson EA, Lin CP, Schuman JS, Stinson WG, Chang W et al. Optical coherence tomography. Science (1979). 1991;254(5035).
Panozzo G, Mercanti A. Optical Coherence Tomography Findings in Myopic Traction Maculopathy. Vol. 122, Arch Ophthalmol. 2004.
Parolini B, Palmieri M, Finzi A, Besozzi G, Lucente A, Nava U et al. The New Myopic Traction Maculopathy Staging System. Eur J Ophthalmol. 2021;31(3).
Parolini B, Arevalo JF, Hassan T, Kaiser P, Rezaei KA, Singh R et al. International Validation of myopic traction Maculopathy Staging System. Ophthalmic Surg Lasers Imaging Retina. 2023;54(3).
Parolini B, Palmieri M, Finzi A, Frisina R. Proposal for the management of myopic traction maculopathy based on the new MTM staging system. Eur J Ophthalmol. 2021;31(6).
Kang SW, Ahn K, Ham DI. Types of macular hole closure and their clinical implications. Br J Ophthalmol. 2003;87(8).
Alkabes M, Mateo C. Macular buckle technique in myopic traction maculopathy: a 16-year review of the literature and a comparison with vitreous surgery. Graefe’s Archive for Clinical and Experimental Ophthalmology. Volume 256. Springer; 2018. pp. 863–77.
Gonvers M, Machemer R. A New Approach to treating Retinal detachment with Macular Hole. Am J Ophthalmol. 1982;94(4):468–72.
Ikuno Y, Sayanagi K, Oshima T, Gomi F, Kusaka S, Kamei M, et al. Optical coherence tomographic findings of macular holes and retinal detachment after vitrectomy in highly myopic eyes. Am J Ophthalmol. 2003;136(3):477–81.
Sasoh M, Yoshida S, Ito Y, Matsui K, Osawa S, Uji Y. Macular buckling for retinal detachment due to macular hole in highly myopic eyes with posterior staphyloma. Retina. 2000;20(5):445–9.
Ripandelli G, Coppé AM, Fedeli R, Parisi V, D’Amico DJ, Stirpe M. Evaluation of primary surgical procedures for retinal detachment with macular hole in highly myopic eyes a randomized comparison of vitrectomy versus posterior episcleral buckling surgery. Ophthalmology. 2001;108(12):2258–64.
Ando F, Ohba N, Touura K, Hirose H. Anatomical and visual outcomes after episcleral macular buckling compared with those after pars plana vitrectomy for retinal detachment caused by macular hole in highly myopic eyes. Retina. 2007;27(1):37–44.
Suda K, Hangai M, Yoshimura N. Axial length and outcomes of macular hole surgery assessed by spectral-domain optical coherence tomography. Am J Ophthalmol. 2011;151(1).
Nadal J, Verdaguer P, Canut MI. Treatment of retinal detachment secondary to macular hole in high myopia: vitrectomy with dissection of the inner limiting membrane to the edge of the staphyloma and long-term tamponade. Retina. 2012;32(8).
Arias L, Caminal JM, Rubio MJ, Cobos E, Garcia-Bru P, Filloy A et al. Autofluorescence and axial length as prognostic factors for outcomes of macular hole retinal detachment surgery in high myopia. Retina. 2015;35(3).
Jo Y, Ikuno Y, Nishida K, Retinoschisis. A predictive factor in vitrectomy for macular holes without retinal detachment in highly myopic eyes. Br J Ophthalmol. 2012;96(2).
Sun CB, Liu Z, Xue AQ, Yao K. Natural evolution from macular retinoschisis to full-thickness macular hole in highly myopic eyes. Eye. 2010;24(12).
Mateo C, Burés-Jelstrup A. Macular buckling with Ando plombe may increase choroidal thickness and mimic serous retinal detachment seen in the tilted disk syndrome. Retin Cases Brief Rep. 2016 Fall;10(4):327–30.
Download references
We would like to express our sincere gratitude to Dr. Barbara Parolini for her invaluable contributions to the field of macular buckling surgery. Her pioneering work in describing the staging system for myopic tractional maculopathy and the surgical techniques for macular buckling has been instrumental in the execution and development of our study.
The authors received no financial support for the research, authorship, and/or publication of this article.
Authors and affiliations.
Department of Ophthalmology, Ribeirão Preto Medical School, University of São Paulo, 3900, Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil
Francyne Veiga Reis Cyrino, Moisés Moura de Lucena, Letícia de Oliveira Audi, José Afonso Ribeiro Ramos Filho, João Pedro Romero Braga, Thais Marino de Azeredo Bastos, Igor Neves Coelho & Rodrigo Jorge
You can also search for this author in PubMed Google Scholar
F.C., J.R.F., and R.J. were primarily responsible for the research design. F.C., J.R.F., J.B., T.B., and I.C. were responsible for data acquisition. M.L., L.A., and I.C. performed the data analysis and drafted the initial manuscript. F.C., J.R.F. and R.J. provided critical revisions and contributed to the refinement of the manuscript. All authors reviewed and approved the final version of the manuscript.
Correspondence to Francyne Veiga Reis Cyrino .
Ethics approval and consent to participate.
The institutional review board and ethics committee of the Division of Ophthalmology, Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil, approved this study (CAAE: 79706624.6.0000.5440).
Not applicable.
The authors declare no competing interests.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
Cite this article.
Cyrino, F.V.R., de Lucena, M.M., de Oliveira Audi, L. et al. Historical and practical aspects of macular buckle surgery in the treatment of myopic tractional maculopathy: case series and literature review. Int J Retin Vitr 10 , 60 (2024). https://doi.org/10.1186/s40942-024-00578-w
Download citation
Received : 04 June 2024
Accepted : 20 August 2024
Published : 28 August 2024
DOI : https://doi.org/10.1186/s40942-024-00578-w
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 2056-9920
IMAGES
VIDEO
COMMENTS
Bournemouth. [email protected]. Abstract —The purpose of this study is to review time series forecasting methods and briefly explains the working of time series. forecasting methods. We ...
Applications of time series analysis in epidemiology: Literature review and our experience during COVID-19 pandemic ... Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. ... Viboud C, Ajelli M, Leung DT, Yu H. Serological evidence of ...
Abstract. Time series analysis with explanatory variables encompasses methods to model and predict correlated data taking into account additional information, known as exogenous variables. A thorough search in literature returned a dearth of systematic literature reviews (SLR) on time series models with explanatory variables.
Time series analysis plays a crucial role in understanding and predicting the behavior of data that evolves over time. It finds applications in various domains, including finance, economics, weather forecasting, stock market analysis, and many others. ... The aim of this paper is to conduct a systematic literature review on time series ...
Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. ... resources and literature review ...
Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. ... Applications of time series analysis in epidemiology: Literature review and our experience during COVID-19 pandemic World J Clin Cases. 2023 Oct 16;11 ...
The motivation for this review came from the observation that the types of algorithms explored and the depth of analysis performed in time series biomedical data science have not been well described. ... The literature search was limited to the last six years for a manageable scope of review. ... Flow chart of the common steps in time series ...
The literature review was conducted following the guidelines of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) . Time series regression model. ... While time series analysis with GLMs or GAMs is the established method in environmental epidemiology research, our review brings attention to several potential issues ...
A time series analysis literature review by Macaira et al. (2018) shows that regression model is the method with the highest number of applications, followed by artificial neural networks (ANNs ...
To achieve this goal, we have conducted a systematic literature review, comprising search strategy planning, screening mechanism determination, document analysis, and report generation. During the search strategy planning stage, eight literature search libraries are selected to obtain the most extensive studies (total of 525 targets).
Literature review of modern time series forecasting methods (This document covers the stochastic linear model approaches) By Paul Karapanagiotidis July 31, 2012 ... Nerlove and Grether (1970)), the í õ ó ìs were to become dominated by the time-domain analysis techniques advocated by Box and Jenkins (1970). There are various reasons for this.
This paper highlights the significance of systematic literature reviews and explores the different techniques employed in these reviews, including statistical methods, machine learning, deep learning, and hybrid methods. ... Evaluation of multivariate transductive neuro-fuzzy inference system for multivariate time-series analysis and modelling ...
View a PDF of the paper titled Time Series Analysis and Modeling to Forecast: a Survey, by Fatoumata Dama and 1 other authors. Time series modeling for predictive purpose has been an active research area of machine learning for many years. However, no sufficiently comprehensive and meanwhile substantive survey was offered so far.
Time series analysis with explanatory variables encompasses methods to model and predict correlated data taking into account additional information, known as exogenous variables. A thorough search in literature returned a dearth of systematic literature reviews (SLR) on time series models with explanatory variables. The main objective is to fill this gap by applying a rigorous and reproducible ...
A LITERATURE SURVEY OF TIME SERIES FORECASTING APPROACHES Smitkumar Arvindbhai Patel, 2023 Abstract: This literature review offers a comprehensive analysis of time series forecasting techniques. It explores traditional methods such as autoregressive integrated moving average (ARIMA) and exponential smoothing, focusing on their strengths and ...
A literature review of the use of DM and statistical approaches with time series data, focusing on weather prediction, is presented, attracting a great deal of attention from researchers in the field. In today's world there is ample opportunity to clout the numerous sources of time series data available for decision making. This time ordered data can be used to improve decision making if the ...
An analysis of history—a time series—can be used by management to make current decisions and plans based on long-term forecasting. One usually assumes that past patterns will continue into the future. Long-term forecasts extend more than 1 year into the future; 5-, 10-, 15-, and 20-year projections are common.
This paper presents a systematic review of Python packages with a focus on time series analysis. The objective is to provide (1) an overview of the different time series analysis tasks and preprocessing methods implemented, and (2) an overview of the development characteristics of the packages (e.g., documentation, dependencies, and community size). This review is based on a search of ...
The. purpose of this research is to identify the most widely used definition of. interval time series; classify existing research into mature research, current. research focus, and research gaps ...
We provide a review of current state-of-the-art and novel time series GANs and their solutions to real-world problems with time series data. GANs have been gaining a lot of traction within the deep learning research community since their inception in 2014 [ 38 ].
nalysis. The objective is to provide (1) an overview of the di erent time series analysis tasks and preprocessing methods implemented, and (2) an overview of the devel-opment characteristics of the packages (e.g., documentation, dependencies, and communi. y size). This review is based on a search of literature databases as well as GitHub repos.
This article presents a literature review on how to process huge amounts of time series that are continuously being produced over time and need to be processed in real-time. Therefore, in Table 1 , we consider papers regarding big data, stream processing, real-time processing, machine learning and deep learning, forecasting, and anomaly detection.
This paper reviews deep learning applications in time series analysis within the maritime industry, focusing on three areas: ship operation-related, port operation-related, and shipping market-related topics. ... M. Systematic Literature Review of Various Neural Network Techniques for Sea Surface Temperature Prediction Using Remote Sensing Data ...
In a literature review, Alkabes and Mateo showed that after MB surgery, the retinal reattachment rate ranged from 81.8 to 100%, while the MH closure rate ranged from 40 to 93.3%. Although persistent MH was identified as a risk factor for retinal re-detachment, eyes with persistent MH that underwent MB did not experience retinal re-detachment.
This pap er has a goal to go through literature that refers to big data, time. series and different big data analytics methods using data mining. Keywords: big data, time series, data mining ...