- 11 photos, 5000000 objects/photo
- Celestial bodies with close astronomical distance (trigonometric function calculation) are regarded as the same
- Complexity: 5000000 * 5000000 * 10 = 250 trillion (times comparison)
For the data computing scenarios : Offline Batch Job , Online Query/Report
Databases that use SQL syntax and are applied to OLAP scenarios
Other technologies for structured data analysis and statistics
Center data warehouse undertakes all data task of whole bank, which is overburdened and can only assign 5 concurrencies to BI system
Only for a small amount of high-frequency data, DB2 is not capable for real-time query, and also unable to achieve data routing, users must select the data source
esProc stores a small amount of high-frequency data, and large low-frequency data is still stored in the data warehouse to avoid repeated construction
esProc takes over the most high frequency computing tasks, and a few low frequency tasks are automatically routed to the center data warehouse
SELECT MAX(ContinuousDays) FROM (SELECT COUNT(*) ContinuousDays FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays FROM (SELECT TradeDate, CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate) THEN 0 ELSE 1 END UpDownTag FROM Stock ) ) GROUP BY NoRisingDays )
SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used in a roundabout way.
Such statements are not only difficult to write, but also difficult to understand.
In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.
It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development and maintenance severely.
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.
We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.
In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint
In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout approach.
In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders ) WHERE rn<=10
A | |
1 | =Stock.sort(TradeDate).group@i(Price< Price[-1]).max(~.len()) |
The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and concise.
A | ||
1 | =file(“Orders.ctx”).open().cursor() | |
2 | =A1.groups(;top(10;-Amount)) | Top 10 orders |
3 | =A1.groups(Area;top(10;-Amount)) | Top 10 orders of each area |
SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups, and there is no need to use the roundabout approach.
WITH e1 AS ( SELECT uid,1 AS step1, MIN(etime) AS t1 FROM events WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1' GROUP BY uid), e2 AS ( SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2 FROM events AS e2 JOIN e1 ON e2.uid = e1.uid WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2' GROUP BY uid), e3 as ( SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3 FROM events AS e3 JOIN e2 ON e3.uid = e2.uid WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3' GROUP BY 1) SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3 FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation performance is very low.
Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added when adding more steps.
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=end_date-14 && etime < end_date && A1.contain(etype) ) |
4 | =A3.group(uid) |
5 | =A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime < t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3) |
SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly according to natural thinking, which is simple and efficient.
This code can handle funnels with any number of steps, as long as the parameters are changed.
The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still difficult to meet complex requirements.
SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful expression capabilities.
Java is too native, lacking necessary data types and computing libraries, making it difficult or even impossible for application programmers to code.
Calendar cal = Calendar.getInstance(); Map < Object, DoubleSummaryStatistics> c = Orders.collect(Collectors.groupingBy( r -> { cal.setTime(r.OrderDate); return cal.get(Calendar.YEAR) + "_" + r.SellerId; }, Collectors.summarizingDouble(r -> { return r.Amount; }) ) ); for(Object sellerid:c.keySet()){ DoubleSummaryStatistics r =c.get(sellerid); String year_sellerid[]=((String)sellerid).split("_"); System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum()); }
High performance algorithms are difficult to implement
No universal high-performance storage
Python's DataFrame is not good at processing structured data computation in complex situations.
import pandas as pd import datetime import numpy as np import math def salary_diff(g):max_age = g['BIRTHDAY'].idxmin()min_age = g['BIRTHDAY'].idxmax()diff = g.loc[max_age]['SALARY']- g.loc[min_age]['SALARY']return diffemp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])salary_diff= emp.groupby('DEPT').apply(salary_diff)print(salary_diff)
There are still shortcomings in terms of computational completeness
Poor syntax consistency
Poor big data capabilities
No universal high-performance storage
No server required, esProc JDBC has independent computing power and can be embedded in application to compute
Extremely unique feature
Server mode, independent deployment, support cluster, provide load balancing and fault tolerance mechanisms
All data sources are logically equivalent, esProc does not own data (only calculates), naturally implementing separation of storage and computation
For data, there is no concept of “inside” or “outside” esProc, no action of “import into” or “export out of” esProc
esProc has independent and complete computing power, independent of data sources, and can read any data sources and mixed calculate them.
All computation is implemented within esProc, will not and also can not translate into SQL
File storage, efficient use, fully guaranteeing computing performance
Binary file, simple format, no need to define data structure
Mixed row-wise and columnar storage, supports indexing, requires pre-defined structure
esProc is based on a brand-new computing model, no open source technology can be cited, and all independent innovation from theory to code.
SPL is based on innovation theory that can no longer use SQL to achieve high performance, and SQL can not describe most low complexity algorithms.
esProc is implemented in pure Java.
esProc can run smoothly under any OS equipped with JVM, including VM, cloud server and even container.
esProc provides a standard JDBC driver for Java applications.
esProc can be integrated in a Java application seamlessly.
esProc can be invoked by a non-Java application via HTTP/RESTFul
As a Java product with good integration, it can seamlessly be integrated in various Java frameworks and application servers, and its logical status is equivalent to self written Java code
For computational frameworks (such as Spark), although esProc can be seamlessly integrated, it has no practical significance; esProc can replace Spark to compute
Specifically, esProc has its own streaming computing abilities and does not need to be integrated in streaming computing frameworks (such as Flink), typically resulting in better functionality and performance
Yes, Of course! esProc supports almost all of the common data sources in the industry and can work directly through the interfaces and syntax of the data sources themselves, without the need to map the data sources to relational data tables.
However, esProc can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm. For high performance, it is recommended to use esProc's own format files for storage.
esProc does not own the data and, in principle, is not responsible for data storage, and any accessible data can be calculated.
In particular, esProc has excellent support for files. Data files can be stored in any file system, including NFS and object storage on the cloud, which naturally implements the separation between computation and storage
When embedded in applications, reliability is guaranteed by the application
When used independently, load balancing and fault tolerance mechanisms are provided, but a single task may fail, only suitable for small-scale clusters
Does not provide automatic recovery function after failure
The elastic computing mechanism of the cloud version avoids the current failed nodes when allocating VMs, achieving high availability to a certain extent
The provided interface can be used to invoke static functions written in Java to extend functionality
esProc also opens an interface for custom functions, which can be used in SPL after registration
Comparing with RDB:
esProc has no metadata, most of computation will begin from accessing data source, it will be a little tedious for very simple operations.
Comparing with Hadoop/MPP:
The cluster function of esProc has not many chances to be well-trained.
esProc has reduced many clusters into a single machine without sacrificing performance in history.
Comparing with Python:
SPL is developing its AI functions, but now is still not even close to Python.
SPL is not a computing engine of the SQL system, currently only supports simple SQL with small data volumes and does not guarantee performance; it can be considered that esProc does not support SQL, and of course it is not compatible with any SQL or stored procedures.
In the future, dual engines supporting SQL will be developed, but it is still difficult to ensure high performance and big data, just to make the existing SQL code easy to migrate.
Not yet.
The information in SQL statement is insufficient to optimize its performance. Frankly, we are not a veteran like RDB vendor for guessing goal of a SQL, so converting SQL to SPL directly will usually lead to slower speed.
SPL is dedicated to low code and high performance.
SPL syntax is easy, and those with Java/SQL knowledge can get on hand in just a few hours and become proficient in it within a few weeks.
“Difficult”, high-performance algorithms are a bit difficult and require learning more algorithm knowledge;
“Not difficult”, once learned, many high-performance tasks become “routine”.
The first 1-2 scenarios will be implemented by Scudata engineer in collaboration with users.
Most programmers are used to the way of thinking in SQL and are not familiar with high performance solutions of SQL. They need to be trained to understand in one or two scenarios.
Performance optimization routines will be experienced and learned. Algorithmic design and implementation are not so difficult.
Give a man a fish and you feed him for a day. Teach him how to fish and you feed him for a lifetime!
The processing speed of big data is 1 order of magnitude higher than that of traditional solutions
Procedural syntax, in line with natural thinking
Rich class libraries
Multi-source mixed computation
Can run independently, or embedded into applications
Single machine can match cluster, reducing hardware expense
Environment-friendly
Development, hardware, O&M costs reduced by X times