Who is the best writer for PL SQL

author

Jürgen Sieben November 3rd, 2015

This article is another of a small series of references which raised my objections when reading specialist literature. Of course, my aim is not to point the finger at other authors (part of this series also covers a reference in my PL / SQL book), rather these references represent the anchor at which I would like to explain a topic from which I think it might be of interest to many readers. Here is a recipe in PL / SQL and the question: When should PL / SQL not be used?

The reference

In a cookbook for PL / SQL code I found a very nice PL / SQL procedure for the problem of deleting duplicate entries from a table. The procedure reads like this:

declare
cursor emp_cur is
select *
from employees
order by employee_id;
emp_count number: = 0;
total_count number: = 0;
begin
for emp in emp_cur loop
select count (*)
into emp_count
from employees
where employee_id = emp.employee_id;
if emp_count> 1 then
total_count: = total_count + 1;
- do_something
end if;
end loop;
end;

Wonderful!

The problem

The table EMPLOYEES belongs to the scheme MR and contains 107 lines. This is nothing. So let's do a more realistic simulation and use the rows in the table ALL_OBJECTS with about 100,000 lines for our examples. First, let's create a table and copy some data. The user's database objects SH I will simply import twice so that I have violations of a primary key to be created:

SQL> create table test_table (
2 owner varchar2 (30),
3 object_id number,
4 object_name varchar2 (30),
5 object_type varchar2 (30));

Table was created.

SQL> insert into test_table (owner, object_id, object_name, object_type)
2 select owner, object_id, object_name, object_type
3 from all_objects
4 union all
5 select owner, object_id, object_name, object_type
6 from all_objects
7 where owner = 'SH';

92553 lines created.

Then we port the recipe to our data:

SQL> create or replace procedure remove_duplicates
2 as
3 cursor object_cur is
4 select *
5 from test_table
6 order by object_id;
7 l_obj_count number: = 0;
8 l_total_count number: = 0;
9 begin
10 for obj in object_cur loop
11 select count (*)
12 into l_obj_count
13 from test_table
14 where object_id = obj.object_id;
15 if l_obj_count> 1 then
16 dbms_output.put_line (
17 obj.object_type || '' ||
18 obj.object_name || 'exists twice.');
19 end if;
20 end loop;
21 end remove_duplicates;
 22  /

Procedure was created.

Since I was about to take a break anyway, I call the procedure.

SQL> set serveroutput on;
SQL> call remove_duplicates ();

TABLE COSTS exists twice.
TABLE COSTS exists twice.
TABLE PARTITION COSTS exists twice.
TABLE PARTITION COSTS exists twice.
TABLE PARTITION COSTS exists twice.
TABLE PARTITION COSTS exists twice.
TABLE PARTITION COSTS exists twice.
...

Call was completed.
Expired: 00: 03: 14: 51

Obviously, this code has a lot of problems, one of which is runtime, and another is the fact that every duplicate is printed twice. Let's start by analyzing how the duplicates are found in the table. We iterate over all the rows in the table TEST_TABLE and open a further cursor for each line determined TEST_TABLEto count the number of rows that exist for this ID. To top it all off, the table is TEST_TABLE of the outer cursor is also sorted, although it is not entirely clear what this is for. The outer cursor will read around 95,000 lines and sort them in the working memory of the session context, in order to then carry out an evaluation on the same table for each line. If it then turns out that a primary key information has been found twice, I have found my duplicate and output it.

Another problem with this procedure is that it is not guaranteed that the further select- Queries are answered in a read-consistent state for the external query. A select- Query does not lock data and as long as the session isolation level is not open SERIALIZABLE was asked, each will select-Query to see the data in the table as it was visible at the time of the query for this corresponding query. The entire query will run in over three minutes, which is the last select- Queries asked the same table over three minutes later for data that was pre-queried by the first query. In addition, there will be around 100,000 select- Queries sent.

If none of that works - think about what kind of problem you have there.

This approach breathes procedural spirit. This is really nothing disgraceful, but completely wrong in databases. The mantra is:

If you have a problem, solve it in SQL.
If that doesn't work at all, solve it in PL / SQL.
If that doesn't work at all, solve it in Java.
If that doesn't work at all, solve it in C.
If that doesn't work at all - think about what kind of problem you have there.

The important thing about this mantra is the sequence of the programming languages ​​to be used. And the problem is that the solution here makes a critical mistake.

Possible solutions

Let's try to save the code by addressing the problem of read consistency and the many select-Attach instructions. In terms of content, the data that is not available as duplicates should be ignored. So we just want to see the dates that appear twice. That sounds a lot like the following select-Instruction:

SQL> select object_id, object_type, object_name, count (*) number
2 from test_table
3 group by object_id, object_type, object_name
4 having count (*)> 1;

OBJECT_ID OBJECT_TYPE OBJECT_NAME NUMBER
---------- ------------------------- ------------------------- ----------
92618 TABLE PARTITION COSTS 2
92613 TABLE PARTITION COSTS 2
92660 TABLE PARTITION SALES 2
92694 TABLE TIMES 2
92850 INDEX CHANNELS_PK 2
92891 INDEX PARTITION COSTS_TIME_BIX 2
...
310 lines selected.

Expired: 00: 00: 00.48

A first improvement, that is obvious (pay attention to the time). This query would be well suited for use in the cursor of the procedure, especially since it would have done the inner query completely:

SQL> create or replace procedure remove_duplicates
2 as
3 cursor object_cur is
4 select object_id, object_type, object_name, count (*) number
5 from test_table
6 group by object_id, object_type, object_name
7 having count (*)> 1;
8 l_total_count number: = 0;
9 begin
10 for obj in object_cur loop
11 dbms_output.put_line (
12 obj.object_type || '' ||
13 obj.object_name || 'exists twice.');
14 end loop;
15 end remove_duplicates;
 16  /

Procedure was created.

SQL> call remove_duplicates ();
TABLE PARTITION COSTS exists twice.
TABLE PARTITION COSTS exists twice.
TABLE PARTITION SALES exists twice.
TABLE TIMES exists twice.
...
Call was completed.

Expired: 00: 00: 00.45

But the question is, what do we actually want to do? Did we just want to write the result on the screen? Then we would have the call from dbms_output.put_line () also as a text concatenation in the select-Can replace query. However, it is more likely that we just want to get rid of the duplicates. The options available here depend on the scenario. Let's go through a few.

We imported data from a CSV

Then the likely scenario would be that we want to copy the good data to a target table and discard the bad ones. But this can also be done quite well without PL / SQL:

SQL> create table target_table
2 as select *
3 from test_table
4 where null is not null;

Table was created.

SQL> alter table target_table add constraint pk_target_table
2 primary key (object_id);

Table was changed.

SQL> insert into test_table (owner, object_id, object_name, object_type)
2 select owner, object_id, object_name, object_type
3 from test_table
4 where rowid in (
5 select row_id
6 from (select rowid row_id,
7 rank () over (
8 partition by object_id order by rowid) rank
9 from test_table)
10 where rank = 1);

92243 lines created.
Expired: 00: 00: 00.17

The insertStatement only copies the data without duplicates. The solution uses the analytical function rank ()that creates a sequence using a grouping criterion according to a sorting criterion. Since the double lines are - well - the same, we need a criterion to decide between the two. Here, for example, the ROWID at. In the outer query, all duplicates are then filtered out because they have one Rank> 1. The list of ROWID of the remaining lines is then used to identify the lines to be inserted. The problem with the PL / SQL statement would be that the "good" rows would have to be transferred to the target table row by row using the procedural logic, which is not necessary with this statement.

We want to clean up data to set up a primary key

A delete instruction directly in the table is a bit more complex because a decision has to be made as to which of two rows with the same values ​​should be deleted.
This solution can also be implemented without PL / SQL:

SQL> delete from test_table
2 where rowid in (
3 select row_id
4 from (select rowid row_id,
5 rank () over (
6 partition by object_id order by rowid) rank
7 from test_table)
8 where rank> 1);

310 lines deleted.
Expired: 00: 00: 00.06

Please also pay attention to the execution time here. The solution uses the analytical function again rank (). In the outer query, only duplicates are filtered because they have one Rank> 1. The list of ROWID these lines are then used to clean up the table.

The good ones in the potty, the bad ones in the croup

If the goal is to separate the two sets of data, there are also several options available. On the one hand, you could be a insertStatement in the target table with a log errors- Provide a clause that will write the offending row to an error table in the event of a primary key violation:

SQL> call dbms_errlog.create_error_log ('TARGET_TABLE', 'TARGET_ERR');

Call was completed.

SQL> insert into target_table
2 select *
3 from test_table
4 log errors into target_err
5 reject limit unlimited;

92243 lines created.

Expired: 00: 00: 00.23

SQL> select count (*)
2 from target_err;

COUNT (*)
----------
       310

Alternatively, the route could be via a multi-table-insertStatement run. In this variant, the data would be replaced by a corresponding select-Instruction provided with a rank and, when inserting the data, a case distinction is made according to the rank:

SQL> insert first
2 when rang = 1 then
3 into target_table (owner, object_id, object_name, object_type)
4 values ​​(owner, object_id, object_name, object_type)
5 else into target_err (owner, object_id, object_name, object_type)
6 values ​​(owner, object_id, object_name, object_type)
7 select t. *, Rank () over (partition by object_id order by rowid) rank
8 from test_table t;

92553 lines created.

Expired: 00: 00: 00.14

You have already seen the preparation of the data; here there is only no filtering via the duplicates, since these are required to differentiate between cases.

Summary

The main problem in programming databases is to get as little in the way of the database as possible. But that is exactly what happens to experienced application programmers often. One is too used to the procedural approach to start looking for a set-oriented approach. SQL has gone to great lengths to eliminate procedural programming and to solve standard problems directly from within SQL.

If you are unfamiliar with any of the techniques used (log errors-Clause, multi-table insert, analytical functions, for example), please understand this as an invitation to grapple with SQL again. The solutions are usually shorter, faster, easier to maintain and do not require extensive unit tests. Enough arguments to leave PL / SQL in the closet for once.