DBA > Articles

High Concurrency with Python and Oracle Database

By: Yuli Vasiliev
To read more DBA articles, visit http://dba.fyicenter.com/article/

With the trend toward more, rather than faster, cores, exploiting concurrency is increasing in importance. Concurrency creates a new paradigm shift in programming, making it possible to write asynchronous code that separates tasks into a set of threads or processes working in parallel. If you are not new to programming and are somewhat familiar with C or C++, you probably already have some idea of threads and processes and know how they differ from each other: threads provide a lightweight alternative to processes when it comes to concurrent programming, which makes multithreading preferable to multiprocessing in most cases. This article therefore discusses the multithreaded approach to concurrency.

As with many other programming languages, separating CPU-intensive tasks into multiple processes in Python (this can be done using the multiprocessing module of the Python standard librarymay give you some performance benefits when utilizing a multiple-CPU machine. It can really run multiple operations in parallel, as opposed to just switching between tasks while performing only one task at any given time, which is true for single-processor machines. In contrast, you may not notice any performance improvement when moving a multithreaded Python program to a multiple-CPU machine, because of the global interpreter lock (GIL), which is used by Python to protect internal data structures, ensuring that only one thread at a time runs the CPython virtual machine.

However, you may still be interested in adding threads to a database-backed Python program to speed it up. The point is that the underlying database your Python program interacts with is most likely installed on a high-performance server that processes submitted queries in parallel. This means that you can benefit from submitting several queries to the database server in separate threads in parallel, rather than issuing them sequentially, one after another, in a single thread.

One word of warning: although utilizing the natural parallelism of tasks may significantly improve application performance, it’s important to realize that not all tasks can be performed in parallel. For example, you cannot issue confirmation e-mails to customers until the operation they requested—say, a money transfer—is complete. It’s fairly obvious that such tasks must be performed sequentially in a certain order.

Another important thing to keep in mind when building multithreaded code is that some threads running in parallel may try to change shared objects at the same time, which may lead to loss of data, data malformation, or even corruption of the object being changed. To avoid this problem, it would be helpful to control access to shared objects so that only one thread can use such an object at once. Fortunately, Python lets you implement a locking mechanism to synchronize access to shared objects utilizing locking tools available in the thread module.

The downside to using locks is that it hurts scalability. When designing for scalability, it’s important to remember that putting a lock on a resource within one thread makes that resource unavailable in all the other running threads and processes until the lock is released. Thus, to ensure efficient resource management, you should not overuse locks, avoiding them whenever possible or releasing them as soon as possible when they are required.

Luckily, you don’t need to worry about locking when you’re dealing with resources stored in an Oracle database. This is because Oracle Database uses its own locking mechanisms behind the scenes when it comes to providing access to shared data in concurrent environments. So it’s often a good idea to keep shared data in the database, thus letting Oracle Database take care of the concurrency problems.

Another good way to achieve scalability and benefit from concurrency is to perform operations asynchronously. In asynchronous programming, the blocking code is queued for completion later by a separate thread, enabling your application to proceed with other tasks. Using an asynchronous framework such as Twisted may greatly simplify the task of building asynchronous applications.

This article provides a brief introduction to building concurrent applications with Python and Oracle Database, describing how to utilize threads in Python code interacting with an Oracle database and explaining how to submit SQL queries to the database server in parallel instead of processing them sequentially. You will also learn how to make Oracle Database take care of concurrency problems as well as how to take advantage of Twisted, a Python event-driven framework.

Multithreaded Programming in Python
Threads are a very useful feature when it comes to parallel processing. If you have a program that is performing time-consuming operations and can divide it into several independent tasks to be performed in parallel, using threads can help you build more-efficient, faster code. Another interesting use of multithreading can be to improve the responsiveness of your application—the main program remains responsive while time-consuming operations are performed in the background.

Enclosing long-running SQL statements in separate threads in Python can be a good idea when these statements do not depend on each other and can be executed in parallel. For example, you might significantly reduce the loading time of a Web page if it is submitting initial SQL queries to the database server in parallel instead of processing them sequentially, thus making them wait for one another.

Another situation in which you might find threads useful is if you have to upload some large objects (LOBs) into the database. Doing this in parallel can not only reduce the overall time it takes to upload your LOBs into the database but can also keep the main thread of the program responsive while parallel uploading is happening in the background.

Suppose you need to upload a couple of binary large objects (BLOBs) to the database and save them to the blob_tab table that you might have created in a custom database schema, as follows:

CREATE TABLE blob_tab(
id NUMBER PRIMARY KEY,
blobdoc BLOB
);

CREATE SEQUENCE blob_seq;


First, let’s look at how you might store BLOBs into the blob_tab table one after another without utilizing threads. The following is the Python script that serves this purpose, persisting two input images obtained by use of the filename and URL, respectively. The example assumes that you have created a blob_tab table and a blob_seq sequence in a usr/pswd custom database schema:

#File: singlethread.py
#Storing BLOBs in a single thread sequentially, one after another

import cx_Oracle
from urllib import urlopen

inputs = []
#if you’re a Windows user, the path could be 'c:/temp/figure1.bmp'
inputs.append(open('/tmp/figure1.bmp', 'rb'))
inputs.append(urlopen('http://localhost/mypictures/figure2.bmp', 'rb'))
#obtaining a connection and predefining a memory area for a BLOB
dbconn = cx_Oracle.connect('usr', 'pswd', '127.0.0.1/XE')
dbconn.autocommit = True
cur = dbconn.cursor()
cur.setinputsizes(blobdoc=cx_Oracle.BLOB)
#executing INSERT statements saving BLOBs to the database
for input in inputs:
blobdoc = input.read()
cur.execute("INSERT INTO blob_tab (ID, BLOBDOC) VALUES(blob_seq.NEXTVAL, :blobdoc)", {'blobdoc':blobdoc})
input.close()
dbconn.close()


Although the task of obtaining and storing figure1.bmp and the similar task for figure2.bmp are going on one after another here, these tasks, as you might guess, are not actually sequentially dependent. So you might refactor the above code so that it reads and stores each image within a separate thread, thus improving performance through parallel processing. It’s important to note that in this particular case, you won’t need to coordinate the threads running in parallel, which significantly simplifies coding.

The following example shows how you might rewrite the above script to use threads, utilizing an object-oriented approach. In particular, it illustrates how you can extend the Thread class from the threading module, customizing it for a particular task.

#File: multithread.py
#Storing BLOBs in separate threads in parallel

import cx_Oracle
import threading
from urllib import urlopen

#subclass of threading.Thread
class AsyncBlobInsert(threading.Thread):
def __init__(self, cur, input):
threading.Thread.__init__(self)
self.cur = cur
self.input = input
def run(self):
blobdoc = self.input.read()
self.cur.execute("INSERT INTO blob_tab (ID, BLOBDOC) VALUES(blob_seq.NEXTVAL, :blobdoc)", {'blobdoc':blobdoc})
self.input.close()
self.cur.close()
#main thread starts here
inputs = []
inputs.append(open('/tmp/figure1.bmp', 'rb'))
inputs.append(urlopen('http://localhost/_figure2.bmp', 'rb'))
dbconn = cx_Oracle.connect('usr', 'pswd', '127.0.0.1/XE',threaded=True)
dbconn.autocommit = True
for input in inputs:
cur = dbconn.cursor()
cur.setinputsizes(blobdoc=cx_Oracle.BLOB)
th = AsyncBlobInsert(cur, input)
th.start()

In the above code, note the use of the threaded attribute passed to the cx_Oracle.connect method as a parameter. By setting it to true, you instruct Oracle Database to use the OCI_THREADED mode, also known as the threaded mode, thus specifying that the application is running in a multithreaded environment. It’s interesting to note here that using the threaded mode with a single-threaded application would not be a good idea. According to the cx_Oracle documentation, turning the threaded parameter to true in a single-threaded application may impose a performance penalty of 10 percent to 15 percent.

In this example, you share a single connection between two threads, creating a separate cursor object for each thread, though. The operation of reading and then inserting a BLOB into the database is implemented here within the overridden run method of the AsyncBlobInsert custom subclass of the threading.Thread standard Python class. So all you need to do to start uploading a BLOB in a separate thread is to create an AsyncBlobInsert instance and then call its start method.

There is one problem with the script discussed here. When executed, it will not wait until the threads being launched are completed—the main thread will finish after launching the child threads, not waiting for their completion. If this is not the desired behavior and you want the program to complete only when all the threads have been completed, you might call the join method of each AsyncBlobInsert instance at the end of the script. This will block the main thread, making it wait for completion of the child threads. Here is how you might modify the preceding script so that it waits for all the threads launched in the for loop to finish:

Full article...


Other Related Articles

... to read more DBA articles, visit http://dba.fyicenter.com/article/