H.-H. Nagel ¹²
¹ Institut für Algorithmen und Kognitive Systeme
Fakultät für Informatik der Universität Karlsruhe (TH)
Postfach 6980, D-76128 Karlsruhe / Germany
² Fraunhofer-Institut für Informations- und Datenverarbeitung (IITB)
Fraunhoferstr. 1, D-76131 Karlsruhe / Germany
Conceptually, a Driver Support System (DSS) for road vehicles should behave like a cooperative human copilot. A DSS should continuously monitor the driver, the vehicle, and its environment in order to inform the driver in time about upcoming major decisions regarding navigation and potentially risky traffic situations. A DSS should relieve the driver of distracting routine activities like tracking the vehicle's position for navigation purposes and warn him about impending threads. In principle, the DSS should be able to drive automatically -- given a suitable situation like stop-and-go traffic on a straight, intersection-free, unidirectional road -- if so desired by the human driver, but still under his final responsibility.
Significant advances have been realized -- for example during the European EUREKA project PROMETHEUS (PROgraMme for a European Traffic with Highest Efficiency and Unprecedented Safety, 1986 -- 1994) -- towards turning qualitative ideas like the ones mentioned in the introductory paragraph into engineering specifications for the design and realization of such a DSS. Similar experiences have been accumulated, too, in the course of research programs which have been initiated since then in Japan and in the USA, for example the Intelligent Vehicles Highway System (IVHS). Although these broad research programs do not concentrate on Computer Vision (CV) in the context of road vehicle surveillance, guidance and safety, the technology-driven exponential rise of computing power at roughly constant -- if not decreasing -- cost, power, and space consumption fuels the exploration of CV in this context.
A fundamental operation of CV for driver support consists in the real-time detection and tracking of images of road and vehicle structures recorded by a video-camera. Since about the mid-nineties, this can be achieved by a standard workstation under carefully controlled conditions. In view of the extremely stringent safety requirements for routine use of such approaches on public roads, it is no surprise that more computing power than currently available in a car will be required in order to provide the necessary robustness. On the other hand, experts expect that CV will be introduced into selected regular road vehicle categories still within the next decade. Although a market for CV-based DSSs has not yet been established, precursors are likely to show up soon. Since a DSS will be a component of cars, buses, and trucks, the automobile industry together with its parts and components suppliers constitutes the primarily relevant industrial sector for this technology.
Obviously, it is still too early to reliably estimate market sizes and growth rates for CV-based DSSs. It is tempting, however, to speculate about the boundary conditions for such a market in the future.
Due to the continuous push towards increased efficiency, safety, and comfort in road vehicles, all kinds of electronic devices have already begun to invade the automobile. It thus appears natural to assume that many of the hooks required for the installation of a DSS -- such as, for example, sensors and medium to high speed digital buses as well as electronically activated actuators (drive-by-wire) -- will already have been introduced into cars independently of a DSS.
An order of magnitude estimate for the DSS market proper might be provided by the following consideration: the cost of a DSS will be proportional to that of a video-equipped Multimedia PC as it is sold today for less than 10 000 DM. On the hypothesis that the price for such equipment will be halved -- even with substantially increased capabilities -- during the next five years due to the booming consumer market, it would correspond to about 10 % of the price of an upper middle class sedan and an even smaller fraction of the price of a bus or truck.
On the assumption that initially less than 1 % of all road vehicles newly produced worldwide per year will be equipped with a DSS at the price tag of 5 000 DM, about 200 000 to 300 000 such systems might be sold annually -- which would constitute a market worth about a billion DM, starting possibly already in five years. In analogy to the introduction of radios, AntiBlockingSystems (ABS), or automatic gear shifts into automobiles, it does not appear irresponsible to predict a further halving of the price tag for a DSS over the ten year period following its introduction, accompanied by an annual increase in the fraction of road vehicles equipped by such a system from 1 % to 10 %. Ten years from now, this would yield a rate of 5 % of the annual road vehicle output or 2 million DSSs per year, corresponding to a market of 4-5 billion DM annually worldwide. Even under the assumption of drastically falling prices, this would result in a sustained period with 50 % growth of the DSS market annually after its introduction. Since this market constitutes an add-on market -- i. e. it does not require to install a completely new sales and maintainance organisation from scratch, but may gradually evolve from current vehicle electronics -- the figures do not appear outright unrealistic.
The roots of Computer Vision for DSS may be traced back to research about vision-based mobile robots in the late sixties and early seventies. Since the computing power which could be made available on a mobile platform was very limited at that time, the initial flurry of interest abated after some years. Only very few groups continued to work in this area until the digitization and processing of video images became more amenable to experimentation at the beginning of the eighties. The ambitious DARPA `Autonomous Land Vehicle' program re-invigorated civilian research efforts towards vision-based autonomous mobility for indoor as well as outdoor vehicles. This in turn stimulated CV-research for vehicle guidance in the PROMETHEUS framework which had been conceived to cover all aspects of road traffic, not only automatic driving -- see, e. g., [Parkes & Franzen 93].
Thorpe ([Thorpe 90]) provides an overview of research at Carnegie Mellon University related to vision-based autonomous road vehicles. This book may simultaneously serve as a representative description of approaches studied during the eighties. A roundtable discussion during the Summer 1990 in Tokyo resulted in a book `Vision-Based Vehicle Guidance' [Masaki 92]. This roundtable discussion gave rise to a series of annual `Intelligent Vehicles Symposia' sponsored by the IEEE Industrial Electronics Society. The proceedings of these symposia, starting in 1992, provide an immediate access path to international reseach results related to DSS. Very valuable information can also be gained from a special issue on `Network, Control, Communication and Computing Technologies for Intelligent Transportation Systems' edited by [Amin et al. 95].
A comprehensive, very well balanced assessment of PROMETHEUS efforts has been published by [Braess & Reichart 95]. An in-depth, two volume treatment of contributions by German AI research groups to PROMETHEUS, including CV in the context of driver support systems, can be found in the final PRO-ART report edited by [Nagel 95].
CV will constitute a major, but clearly not the only type of sensoric channel required to offer a driver comprehensive support for his activities. In order to indicate more precisely the role which CV might play in this context, we subdivide the overall capabilities of a DSS into four sectors:
The DSS is considered to be an agent, i. e. an encapsulated digital process endowed not only with its own internal state space and control, but in addition with an explicitly statable goal, externally visible actions which enable the agent to sense the state of its environment as well as to communicate with and to influence its environment, and a planning capability which concatenates actions into a plan in order to achieve the goal of the agent. Such an `agent concept' provides a computational model which facilitates to talk about the DSS in a precise manner: it is assumed that a DSS is permanently alive once the vehicle electronics has been switched on, the DSS may at any time initiate activities -- for example communicate with the driver -- and it may spawn subagents to whom it may delegate to pursue clearly circumscribed subtasks as their (sub)goals.
The multi-faceted potential contributions of CV to a DSS can be best discussed by first collecting basic driving manouvers which a driver has to perform in order to guide a vehicle safely through road traffic. A set of generic driving manouvers will be defined for this purpose. Generic in this context means that parameters of the following nine manouvers have to be determined on the one hand by the current subgoal which should be achieved, and on the other hand by the currently prevailing boundary conditions, in particular by the current traffic situation around the DSS vehicle:
The execution of a driving manouver implies the exertion of both lateral as well as longitudinal control. Both types of control require sensorical input about the actually prevailing relations between the vehicle and its environment. To provide such input constitutes the principal contribution of CV to a DSS. More specifically, the following pieces of information have to be obtained by CV:
It is by now established practice to place between 3 and 10 small windows on the expected image plane locations of bright lines marking the left and right delimitation of a lane. Edge elements within each window are associated with the projection of the model lane border into the image plane (sometimes called a `model segment'). (Weighted) deviations between edge elements and associated model segments are usually fed into an Extended Kalman Filter (EKF) in order to update the parameters characterizing the current camera position and orientation -- and thus, for a camera fixed to the vehicle, the vehicle coordinate system -- with respect to the lane.
Automatic lateral control of road vehicles on highways based on such a CV approach is considered State-of-the-Art, both for continuous as well as for discontinuous lane markings. Essentially similar approaches have been used in order to detect lines delimiting adjacent highway lanes on either side of the one used by the camera-carrying vehicle itself. Up until recently, special purpose computers, configurations of Digital Signal Processors, or a network of standard processors have been required in order to achieve this in real-time. A standard, 1996 vintage VLSI CPU can roughly cope with the computations required for real-time tracking of a well-marked, reasonably illuminated highway lane.
Innercity roads frequently turn much more sharply than highways or rural roads: clothoid or extended parabolic arc models used for the latter can not be easily adapted to the more complicated conditions of innercity roads and intersections, in particular since significant parts of the road border may be occluded by, e. g., parking vehicles. As soon as more computing power can be made available within vehicles, model-based approaches are likely to be investigated for innercity lane detection and tracking. Similarly, the lane width is not yet routinely estimated from recorded video images as a variable parameter of a generic lane model. Roads are not restricted to planes, although for highways and larger roads outside cities and mountainous areas, these vertical curvatures are so small that they usually have been neglected.
Whether additional computing power will be used in order to refine road models as discussed above or rather to increase robustness of lane detection and tracking under more adverse operating conditions such as driving by night, during fog, heavy rain, or snow, will have to be seen.
Relatively few experiments have been reported so far which have been specifically addressed towards the detection and handling of intersections by CV. An intersection or junction is usually specified as either a gap in the lateral lane or road marking or as a stop line across the approach lane. Since most driving manouvers so far have been performed on highway-like roads, the most frequent manouver upon encountering an intersection or junction area consisted in driving straight across. In these cases, a gap in the lateral lane markings is simply considered as a failure to pick them up correctly: it is handled by straight extrapolation of previously estimated lane parameters.
If, however, a sharp turning manouver had to be performed at an intersection or junction, a road model of the specific intersection/junction has been used. Complications arise if the visible section of lane boundary markings become difficult to detect or if lane boundaries temporarily disappear from the field of view of a camera oriented straight ahead, i. e. along the tangent of the current vehicle trajectory. In such cases, the camera has to be mounted on a panning head or more than one single camera has to be used. Very few systematic experiments have been reported so far regarding such conditions.
Additional complications may arise if the intersection or junction area becomes complicated enough to no longer facilitate capturing the section essential for lateral control entirely by a monocular video stream. Apart from relating a virtual vehicle trajectory to two or more images at the same time, the initialization and tracking of lane boundary markings in the different image planes may become difficult due to combinatorial problems during the association step between edge elements and model segments.
In addition to continuous or interrupted bright lines indicating lateral lane boundaries, other bright marks are painted occasionally onto a lane, for example arrows indicating obligatory driving directions (straight ahead, compulsory right turn, etc.), signs indicating that stopping or parking is prohibited, special lanes for buses or bicyclist, etc. Although such symbols are normed, it has been observed that the actually painted road mark frequently does not (fully) conform to the norm. Recognition of such symbols within a DSS thus constitutes a problem which has rarely been addressed by CV in vehicles.
A multitude of shapes will have to be handled although it should be possible to adapt methods from workpiece recognition on conveyor belts for these recognition tasks. The problem is aggravated in heavy traffic if parts of a road marking are occluded by other vehicles such that a virtual image of a marking has to be built by concatenating successively visible parts.
Practically nothing has been published about the detection and tracking of lane marks constituted by rows of reflectors nor have difficulties been treated which arise when -- possibly only temporarily -- irrelevant lane markings have not been completely removed or are superseded by new ones in a different color (for example yellow instead of white).
Traffic signs and traffic lights mounted on posts close to a lane will appear in a roughly known region of an image recorded by a vehicle-mounted, forward-looking camera. Since traffic lights and signs appear in shapes and colors known a priori, color cues are exploited within such search regions to quickly collect subregions which should be tested for compatibility with the hypothesis of representing a traffic sign or traffic light. Approaches which rely only on shape cues in order to detect and classify traffic signs seem to be less reliable.
It appears as if the `quick-and-dirty' approaches towards traffic sign detection and recognition have been exhausted. The challenge to gradually decrease the failure and false alarm rates as well as to increase the correct recognition rate even under adverse conditions is likely to be taken up by specialized research groups within companies. Relevant results thus are less likely to be described in detail in the scientific literature.
The norms regarding shape and color coding within traffic signs support a systematic approach towards their detection and classification. In order to obtain reliable detection and classification results, however, significant efforts are required even for isolated traffic signs positioned along highways and rural roads. Raw color data have to be carefully postprocessed in order to delimit detrimental influences of illumination or recording conditions. A multi-step hierarchical approach, intermixing color as well as shape evaluation in the image plane, appears to be necessary just in order to classify an image region into one of the many categories of traffic signs. Given a 1996 vintage VLSI CPU, this is possible in real time provided there are not too many distracting similar signs or shapes. When it comes to a reliable deciphering of symbols within a traffic sign, less is known.
Traffic sign recognition on highways is usually attempted with a tele-camera (fixed to the vehicle) in order to facilitate early detection. Not much has been published about repeated evaluation of the same signs in consecutive image frames, nor about combined evaluation of traffic sign images in windows extracted initially from sequences recorded by a tele-camera, with a subsequent switch to image regions from frames recorded by a wide-angle lens camera as soon as the vehicle approaches the traffic sign. Only cursory reports are available regarding advantages to actually track the hypothetical image of a traffic sign by a video-camera mounted on a pan/tilt-head within a vehicle in order to increase the signal-to-noise ratio for a more precise evaluation.
So far, driving tasks and the associated CV approaches have been discussed on the basis of essentially planar scene models. Even traffic sign recognition could be handled by a picture domain approach, given `normal' highway or rural road conditions. Problems begin to aggravate, however, if traffic signs have to be looked for not only at posts planted near the road border, but also on arms or bridges extending across multiple-lane roads. Likewise, the 3-D structure of the environment has to be taken into account if the lane marking is provided by rows of beacons or posts or if the vehicle has to manouver through narrow underpasses or gates. The latter conditions rarely occur on highways, but rather tight situations may have to be coped with along road construction sites and in the vicinity of traffic accidents.
In innercity road traffic, 3-D analysis can not be avoided, not the least since vehicles and other objects may occlude part of the road limits or traffic lights. Not much has been published so far about such problems.
The common approach towards the detection of obstacles searches for cues which are incompatible with the assumption that a certain area of the image corresponds to a road surface: such cues could be significant gray value transitions, texture boundaries, or -- alternatively -- regions with unexpected gray values or colors. Although these approaches are relatively cheap regarding the necessary computing power, usually their detection rate is low or their false alarm rate is high.
More reliable are approaches which essentially exploit the phenomenon that anything extending from the -- assumed planar -- road will exhibit a disparity which differs from that of points on the road plane itself if image frames recorded from different vantage points are compared. So far, all such approaches assume that the external camera parameters are known. Three types of approaches can be distinguished here, depending on whether the image frames based on which the disparity is estimated are recorded
In case (2) above, the relative calibration of a stereo camera pair is replaced by knowledge about motion of a monocular camera with respect to the road between two recorded image frames. Both approaches (explicit calculation of the expected shear mapping or the determination of a disparity value between corresponding image locations) sketched for the comparison of stereo image frames can then be applied to this case, too.
The approach mentioned above under (3) differs from the disparity variant of approach (2) only insofar as the optical flow is estimated instead of the displacement between corresponding image locations from two frames taken some time apart. Optical flow estimation is still time consuming. If sufficient computing power is available, it has the advantage of being performed by a non-search calculation restricted to a local spatio-temporal (x,y,t)-volume from the recorded monocular gray value stream. Even non-prominent texture or gradual gray value transitions such as those due to illumination gradients can be exploited in this manner, resulting in a more densely populated optical flow field. Feature-based approaches towards the estimation of optical flow usually result in less densely populated optical flow fields which increase the difficulty to reliably segment the image area corresponding to a potential obstacle.
The advantage of all these approaches -- versus those based on the detection of unexpected gray value configurations in the image area associated with the road surface in front of the vehicle -- consists in the fact that knowledge about the geometry of a projective transformation from one view to another can be exploited. Even high contrast marks or shadows on the road surface can thus be easily distinguished from objects extending vertically from the road plane. Special purpose processor arrangements already allow to compute optical flow fields or warping transformations in real time although the resolution and reliability still leaves ample room for improvement.
A vehicle on the same lane, but in front of the DSS constitutes an obstacle, albeit of a special kind if it moves at a limited velocity differential with respect to the DSS. On highways and rural roads, preceding vehicles have to be detected and tracked in order to decide whether they should be followed or overtaken.
Two approaches have been devised which both take advantage of the special type of `obstacle' expected in this case. The first approach assumes that the rear side of the preceding vehicle exhibits a marked symmetry around its central vertical axis in the image and that the silhouette contrasts well against the background along the horizontal direction. A Hough Transform is used with tentatively paired edge elements in order to search for edge element pairings with a lateral distance compatible with the rear view of a vehicle and a center position corresponding approximately to the middle of the lane. Depending on the sophistication of the approach, a simple rectangle might be fitted to image gradients, initialized by the size and position estimates obtained by the Hough search for symmetric pairs of vehicle side edge segments. A further step in sophistication consists in fitting the projection of a simple 3-D vehicle model in the form of a parallelepiped to the image edges, to initialize a Kalman Filter and to track such a model from frame to frame.
A different detection approach exploits the observation that the rear side of a road vehicle causes a shadow on the road which very frequently can be clearly detected as a fairly sharp gray value transition to very dark values (corresponding to the visible road area underneath the preceding vehicle). As soon as an approximately horizontal edge segment of appropriate length and location can be detected in the image area corresponding to the road in front of the DSS, it is hypothesized to constitute the rear shadow edge of a preceding vehicle. One may then search for vertical edge segments corresponding to the left and right sides of the preceding vehicle and use these three cues in order to initialize either a 2-D rectangle or a 3-D box model for tracking.
Both of the approaches mentioned can be implemented to work in real-time on a 1996 vintage VLSI CPU. It depends on what information about the preceding vehicle will be required for the driving manouver to be supported by the DSS whether more complex model-based tracking and recognition procedures are initialized. In any case, if the hypothesis that a preceding vehicle has been detected in the lane of the DSS has been firmly established, the distance to this preceding vehicle can be easily inferred from its rear end projected onto the road plane, exploiting the camera calibration of the DSS. Given a reliable estimate of this distance to the preceding vehicle, longitudinal control of the DSS vehicle can be based on this distance estimate (Follow_Preceding_Vehicle). Alternatively, a warning to the driver can be generated if this distance drops below a safety threshold.
The principal road signs on highways and major roads are normed. Their detection and initial treatment essentially resembles that of traffic signs. The problem arises once the written content has to be evaluated. Considerable variations in the size of characters -- even within the same road sign -- as well as the mixture of symbols and text let it appear advisable to develop subagents responsible for the detection, tracking, and interpretation of road signs.
Experience with address label reading machines suggests that a mere capability to read single characters will not suffice in general. A dictionary of names which may appear on road signs as well as a data base of relations - capturing likely clusters of related names as well as rankings indicating which names are likely to appear together in a certain area on a particular road - may turn out to be required. Given the fact that a lot of related information has already been stored on CD-ROMs in vehicle navigation systems, it appears reasonable to exploit this information in order to increase the reliability of road sign interpretation.
One might even think about timely advice to the driver to slow down in order to secure the reliable scanning of road signs prior to critical junctions. Even if a navigation system may be advertised to provide independence from road signs, continuously cross checking the data base of the navigation system against road signs might decrease the risk to miss recent changes in the road network.
This essay deliberately delimited the discussion since otherwise no reasonable synthesis of the abundant material appeared feasible within the available space. Restriction to CV allows to exclude, for example, microwave radar, ultrasound, and range sensors. Analogously, subjects such as off-road driving, mobile indoor robots, vision-based guidance of vehicles on railways, of aircrafts or ships have not be taken into consideration. Restriction to CV for a DSS also allowed to exclude the treatment of road surveillance, for example in the context of traffic management systems. This implied that a DSS is considered to be an essentially vehicle-autarchic agent.
An attempt has been made to explicate the assumptions which underly certain categories of approaches: such assumptions are reformulated as components of a model required for a model-based CV approach. Models for bodies formalize expectations of the designer of a CV system about what should be searched for, segmented, and tracked in an image sequence. Similarly, motion models formalize expectations about systematic changes of relative pose between a body and the recording camera system -- or the vehicle carrying the camera(s).
Explicating assumptions underlying a CV system by reformulating them as models of bodies and their properties, of spatiotemporal relations between bodies and of their change with time, prepares for the next abstraction step: association of geometric results with conceptual descriptions. Explicit, system-internal models for bodies and their movements facilitate to associate spatiotemporal gray value variations with states and state-changes of traffic participants recorded by a videostream. This, in turn, requires to introduce the concept of a scene-agent -- a movable body with additional attributes, observable by CV in the recorded scene -- which has to be clearly distinguished from the concept of a system-internal agent used to conceive and realize a perspicuous internal structure of a DSS. Scene-agents exhibit visible behavior due to their property to concatenate elementary, movement-related activities in a goal-oriented manner. This implies that the DSS even introduces models of ways in which different types of movements are concatenated: models for different types of motion, their duration and change encode expectations about the behavior of scene-agents.
The abstraction step from geometric results to conceptual descriptions necessitates, too, to take into account uncertainty due to sensor and transducer noise, artifacts of the image evaluation process (caused by, e.g., oversimplified assumptions or numerical inaccuracies), as well as due to inherent vagueness of concepts used for the automatically generated description.
The discussion of the preceding Section implicitly relied on a scenario which differentiated vehicle guidance into
An algorithm which has been implemented on the basis of explicated models may be somewhat slower in comparison with one which only implicitly encodes the ideas underlying its design. Explicating the assumptions underlying an approach, however, serves an additional purpose: such an algorithm becomes amenable to a systematic analysis, an important advantage in the longer run as the following consideration illustrates.
A semi-logarithmic plot of some measure of computing power provided by a VLSI CPU versus its first year of availability exhibits a surprisingly linear relationship over two decades, indicating an exponential increase in computing power at a rate of close to 50 % per year. This adds up to an increase by an order of magnitude every five years, at about constant cost, power, and space consumption. If an algorithm has been designed or hand-tuned for a particular CPU or some special purpose processor, the design decisions on which such efforts have been based must be re-evaluated about every two to three years in order to exploit the increase in computing power accrueing due to the technological innovation. For an algorithm amenable to rational analysis, this effort can be considerably smaller, thus providing a substantial competitive edge in the longer run even if such an algorithm might be somewhat slower initially. This argument becomes relevant as soon as the initial performance begins to reach threshold requirements for real-time experimentation as it is the case since the mid-nineties: the computing power of a modern VLSI CPU begins to become comparable with that required for real-time elementary treatment of B/W interlaced video signals (576 lines by 768 pixels, each at 8 bit gray values). `Elementary treatment' implies the detection of local gray value transitions and initial selection steps such as non-maximum suppression in, say, a 3×3 or 5×5 pixel environment. This premise formed the basis for the argument to forego a discussion of -- in general oversimplified -- approaches such as thresholding a gray value picture, followed by carefully tuned processing of binary images: it is well known that such approaches are usually very brittle.
In summary, Computer Vision has proven that it provides a reliable basis for solution approaches to be incorporated in Driver Support Systems. Technology has advanced to the point where the emphasis in algorithmic development begins to shift from ad-hoc approaches -- essentially enforced previously by insufficient computing power available onboard a road vehicle, thus presenting an invitation to cut all feasible (and infeasible) corners -- towards analyzable, well engineered approaches. At the moment, it appears too early to convert such approaches into products for the mass market. This may change soon, however, as an extrapolation of the computing power becoming available within the foreseeable future allows to argue. In view of the complexity of the overall task, experience is required to properly exploit this computing power. It appears to be high time to prepare if one intends to enter the market. This essay attempts to convince the reader that the scientific basis for such an endeavor has come within grasping distance.