Maybe the most interactive data analysis programming esProc SPL

Data analysis programming?

SQL ?

esProc SPL !

Python ?

Looks simple, actually hard

e.g.Simple query

select id, name from T where id = 1

select area,sum(amount) from T group by area

e.g.Calculate the longest consecutive rising days for each stock

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT, 
                SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
                CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
                    ELSE 1 END AS updown_flag
            FROM stock
        )    
    )
    GROUP BY CODE, no_up_days
)
GROUP BY CODE

e.g.A player who scores three times within one minute

e.g.Number of users active for three consecutive days within every 7-day period

e.g.Next-day retention rate for new users

e.g.Price increase on the day when the stock price is higher than the price 5 days before and 5 days after

e.g....

Debugging is cumbersome

SQL lacks common debugging methods such as setting breakpoints and stepping into code; when nested in multiple layers, it needs to be split and executed layer by layer, which becomes very troublesome.

SELECT CODE, MAX(con_rise) AS longest_up_days

FROM (

SELECT CODE, COUNT(*) AS con_rise

FROM (

SELECT CODE, DT,

SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days

FROM (

SELECT CODE, DT,

CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0

ELSE 1 END AS updown_flag

FROM stock

)

GROUP BY CODE, no_up_days

)

GROUP BY CODE

Heavy and closed, with a complex environment; external data must first be imported, with many tedious tasks involved.

6 Major Advantage, esProc SPL helps data analysts.

1More Concise Code

e.g.Longest consecutive rising days of each stock

SPL is simpler, eliminating any loop statements.

SQL

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT, 
                SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
                CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
                    ELSE 1 END AS updown_flag
            FROM stock
        )    
    )
    GROUP BY CODE, no_up_days
)
GROUP BY CODE

SPL

	A
1	=stock.sort(StockRecords.txt)
2	=T(A1).sort(DT)
3	=A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days)

Python

import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

2Strong Interactivity Suitable for Multi-Step Exploratory Analysis and Convenient Debugging

e.g.Find the periods of stock price increases lasting for more than 5 consecutive days.

3Built-in parallel out-of-memory calculations.

In-memory

	A
1	smallData.txt
2	=file(A1).import@t()
3	=A2.groups(state;sum(amount):amount)

Out-of-memory

	A
1	bigData.txt
2	=file(A1).cursor@t()
3	=A2.groups(state;sum(amount):amount)

Serial

	A
1	bigData.txt
2	=file(A1).cursor@t()
3	=A2.groups(state;sum(amount):amount)

Parallel

	A
1	bigData.txt
2	=file(A1).cursor@tm()
3	=A2.groups(state;sum(amount):amount)

4Lightweight portability

Runs smoothly on a regular laptop without the need for a server cluster.

Stores data in highly compressed files, making it easy to carry.

5Openness, directly calculate diverse data sources

6Pure Java Enables direct Integration of Explored Results into Enterprise Applications

Python surpasses SQL, still difficult to write in complex situations

e.g.Longest consecutive rising days of each stock

import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

Debugging still relies on print method

No big data support, Pseudo parallelism

Non-Java System Often Requires Redevelopment When Integrating Explored Results into Enterprise Application

e-book "SPL Programming"

Download esProc SPL

Forum