TV Stream Table of Content : A New Level in the Hierarchical Video Representation

With the rapid development of nowadays technologies, TV could keep its position as one of the most important entertainment and sometimes educative utilities in our daily life. However, keeping this position required a lot of major changes to take place in order for the TV to follow up with the digital revolution, such as, digital broadcasting, High Definition TV, TV on demand. TV-REPLAY, WebTV, etc. This evolution accompanied with many other factors such as the vast spread of communication means and the low prices of storing media have all resulted in many other indispensable technologies for video content storing, structuring, searching and retrieval. Video content can be of various types: a sequence of frames, a sequence of shots, a sequence of scenes, or a sequence of programs which is what the TV stream is usually composed of. Video content structuring would be of a great benefit to help indexing searching and retrieving information from the content efficiently. For example, structuring a soccer game into Play/Break phases facilitates later the detection of goals or summarizing the soccer video. Another example is to structure a news program into stories where each story is composed of an anchorperson segment followed by a report, which facilitates later the search of a specific story or an intelligent navigation inside the news program. However, all the existing analysis methods are dedicated for one type of video content. Such methods generate very poor results if it is applied on a TV stream that is composed of several video programs. So, it is important to detect a priori the boundaries of each program and then identify the type of each program in order to run the dedicated analysis method based on the type. For a TV viewer, a TV stream is a sequence of programs (P) and breaks (B). Programs may be separated by breaks and may include also breaks. For analysis purpose, the stream can be considered as a sequence of audio and video frames with no markers of the start and end points of the included programs or breaks. Most of TV channels that produce TV streams provide a program guide about the broadcasted programs. However, such guides usually lack precision, especially with the existence of live programs which makes the prediction of their start and end very hard. Moreover, program guides do not include any information about the breaks (i.e. commercials). Hence, one of the important steps to structure TV video content is to segment it into different programs and then choose the appropriate method to segment each program separately based on its type. The TV stream structuring consists in detecting the start and end of all the programs and breaks in the stream and later trying to annotate automatically each program by some metadata that summarizes its content or identifies its type. This step can be performed by analyzing the metadata provided with the stream (EPG or EIT), or analyzing the audio-visual stream itself. In this article, we define what we call TvToC (TV stream table of content) that adds a new level in the hierarchical video decomposition (traditional video ToC). Then, we provide a comparative study of all the methods and techniques in the domain of TV stream segmentation. Besides, a comparison of the different approaches is done to highlight the advantages and the weaknesses of each of them.


Introduction
With the rapid development of digital capturing, storing and communication devices, the capturing, production and sharing of multimedia content has become very easy and very common.With a simple click on a mobile phone, or on a computer key board using a recording and video production software, you can produce, share or even broadcast TV easily.Moreover, the social media networks have facilitated more the spread of multimedia content, e.g.sharing a video with thousands of people or watching a TV stream on a computer or a smartphone.However, to get the most benefit from this huge number of stored video streams, they need to be easily accessed, retrieved and browsed which is still considered a problematic issue to be addressed.
The traditional provided way to access video content is to use the fast rewind and fast forward with different speeds in order to navigate to the part of the video that interests a user manually.Such navigation is usually considered inefficient since it is time-consuming especially when the video is long and it has no-relation with the video content.That is why, providing intelligent video content access methods is of big interest.For example, a story in a news program can be skipped with a simple click on a remote control if it does not interest the viewer.The structure of the video is the key of such intelligent access.A lot of exiting work in the literature has addressed the video content structuring.
An important method proposed to access video content is inspired from textual-book access methods [1].In a book, the table of content (ToC) is one of the efficient mechanisms to access the content without reading the whole document.The ToC helps the reader to find the chapters or sections of interests and to navigate directly to the part of interest in the document.Moreover, a document contains index words that are considered as relevant keywords to the readers and their locations in the document.Such index can be used to reply the query of users.So, the ToC helps the readers to navigate within the document intelligently while the index helps them to retrieve information from the document.ToC helps to give a summary at the beginning of the document that helps to overview the entire content.
The video content structuring methods have mapped the idea of ToC to the video content.With the help of video ToC, we can browse and retrieve information much easier.However, to construct such ToC for video content, several challenges would be in question.Contrarily to a book, videos are not always of apparent and common structure.Some of them could be well structured such as news programs (an introduction, a presentation about a topic, a report, a presentation about the next topic, a report and so on) or tennis game (points, games, sets) while others are very difficult to be structured such as a soccer game for example (hardly structured in play/break phases).On the other hand, each type of video content will need its own method of structuring, e.g. a method that structures a news program cannot structure a movie program.Consequently, TV stream that normally contains more than one program (several video segments belonging to different programs) from different types and natures should be separated into programs, and then, each program type should be identified in order to run the relevant structuring method accordingly.As a result, additional information related to the boundaries of each program needs hence to be included in the ToC when it would be aligned with video content.The process of detecting the boundaries of programs in a TV stream in the objective of segmenting it into separated programs is nowadays called TV stream macro-segmentation.This name was given to differentiate the process of detecting boundaries from the usual segmentation that is done in a single program to segment it into many smaller parts (scenes, shots, etc.).
Before start presenting the TV stream structuring methods, we may ask ourselves an important question, why we need to structure TV streams if we know that TV channels produce the streams before broadcasting it and thus they should have precise metadata about the broadcasted streams (start, end and description of programs).In practice, broadcasted TV streams have no metadata except the electronic program guide (EPG) or the event information table (EIT) which lack precision especially if you have live programs that you cannot predict their start and/or end times a priori.Moreover, TV channels do not provide precise data about their content to prevent third-parties to archive and build novel TV services (TVoD, Catch-up Tv …) without returning back toward the channels.We should not also forget that the process of production of streams is very complex and many persons are involved in the process which makes the preparation of metadata not trivial task.Furthermore, delivering precise metadata to viewers would open them the possibility to skip commercials which are the first financial source of TV channels (in recorded streams or catch-up TV service) [2].
The aim of this article is to present an overview of the TV stream structuring methods in the literature and discussing the approaches and results obtained.The article is organized as follows: Section 2 defines what we call the TV stream table of content (TvToC).We present a state of the art of the existing method for TV stream segmentation in section 3. The dataset used, the evaluation measures calculated and the results obtained by each approach is provided in section 4 in addition to a discussion of the efficiency of each of them.We conclude the article in section 5.

TV Stream ToC
Before the manipulation of what we have called TV streams, video content segmentation or structuring has considered the video content of one type except for some of them that contains commercials.In [3], the video content structuring was defined as the task of decomposing the video into units and constructing the relationships between them.In text documents we find chapters, paragraphs, sentences and words.Similarly, in a video, we find the video itself, group of scenes or stories, scenes, shots, sub-shots, keyframes.Others consider the video content segmentation as a classification problem in which shots are clustered into groups in order to obtain video scenes which are clustered in order to obtain stories and so on.
The six-level video units are defined as follows: 1. Video: Flow of video and audio frames presented at a fixed rate.

Story or Group of scenes:
Several scenes that capture continuous action or series of events.This element is relevant for some video genres such as news reports and movies.

Scene:
A series of shots that is semantically related and temporally adjacent.It is usually composed of a series of shots recorded in the same location.4. Shot: A sequence of frames that are recorded continuously with the same camera.5. Sub-shot or micro-shot: A segment in a shot that corresponds to the same camera motion.Each shot may be composed of one or more consecutive subshots depending on the camera motion.

Key-frame:
The frame that represents a shot or a sub-shot.Each shot and sub-shot may be represented by one or more key-frames.
In Figure 1, we present the six-level hierarchy.Each unit in a level can be produced by aggregating several units in the lower level (clustering-based techniques) or segmenting units in the upper level (segmentation-based techniques).For example, a scene can be identified by aggregating several shots or by segmenting a story.The literature is very rich in techniques that address one or several levels of this hierarchy (segmentation-based or classification-based approaches).You refer to [3][4][5][6][7][8][9] for more information about the segmentation-based techniques and to [10,11] for a review of the classification-based ones.Unfortunately, the six-level hierarchy cannot be constructed for all types of videos.Some of them do not have a clear structure.In the literature, we can identify two main types of videos: Structured videos (News, Tennis game …) and unstructured or semi-structured videos (i.e.soccer game, video surveillance …).The structured video is the one that is produced according to a script or plan and can be edited later [7].For unstructured and semi-structured content, instead of decomposing the video into the six-level hierarchy, it is decomposed into logical units.For example, we cannot decompose a soccer game into scenes and stories.However, most of the techniques in the literature decompose a soccer game into Play/Break sequences.The Play unit represents the sequence of shots in which the ball is inside the field and the game is going on while the Break unit represents the case when the ball is outside the field (Read [12,13,14] for more information).For a video surveillance, the units are not clear.Techniques of the literature consider the activity in the game as Play and the non-activity as Break (Read [15] for more information about video surveillance).
For a TV stream, the six-level hierarchy is not sufficient.For a user, he may be interested to browse a video by scenes or stories which are not the case for a TV stream viewer.A TV stream may be composed of a large number of scenes and stories [16].It contains several heterogeneous programs which are usually separated and interrupted by breaks (commercials) and each has its ToC.For TV stream browsing and retrieval, it is more practical to append some levels to the hierarchy that facilitate the navigation by programs and then we have for each program its ToC that allows us to navigate deeply within it.Figure 2 shows the levels that may be added to the hierarchy.The new ToC will be named TvToC.In such hierarchy, the user may skip programs that do not interest him and go deeply in others.A level that links programs of the same type is inserted.This level may be done by categorizing the programs of the TV streams (i.e.[10,20]).
The new units are defined as follows:  TV stream: defined as contiguous sequence of video and audio frames produced by TV channels.
It is composed of a series of heterogeneous programs (P) and breaks (B) without markers at the signal level of the boundaries of the programs and the breaks.Two consecutive programs are usually, but not always, separated by breaks.Each program may be also interrupted by breaks. Break (B): Every sequence with commercial aim such as commercials, interludes, trailers, jingles, bumpers and self-promotions.In some references [2,17], breaks are also called inter-programs or non-programs. Program (P): Every sequence that is not of break type (movies, TV games, weather forecasts, news…).Programs have culture, informative or entertainment aim.Sometimes, a program may be composed of several parts separated by break sequences. TV stream structuring: Known also as TV stream macro-segmentation is the process of precisely detecting the first and the last frames of all the programs and breaks of the stream and in annotating all these segments with some metadata.As a consequence, TV stream structuring allows user to recover the original programs that construct the continuous stream.

State of the Art
Most of the structuring methods proposed in the literature focused on structuring a single program and they didn't handle streams containing several heterogenous programs.In our review, we have focused on two complementary tasks: The first task is how the stream is segmented into sequences of Program/Break while in the second, we present, if proposed, the method to label the segmented programs with some metadata and what is the source of these metadata.
In order to segment TV streams, several types of approaches were proposed in the literature: 1.The first type of approaches focuses on segmenting the stream into logical units and then classify each unit as being a part of a program or a part of break such as proposed by [2,18].The logical units to be classified may be of different granularities (Key-frame, Shot, Scene, Stories …).After the classification step, consecutive units of the same type are merged together.2. The second type of approaches focus on the detection of discontinuities in the homogeneity of some features [19], the modeling of the boundaries between program and breaks [21], or the detection of the repetition of opening and closing credits [16].3. The third type of approaches is based on the fact that breaks have repeated behavior.Some of the techniques recognize breaks in a reference database [17] or by searching the repeated logical units [2,18,23].Some program may have repeated parts such as the opening and closing credits of news programs, the latter should be followed by a classification step in order to separate repeated program segments from repeated break ones.After the stream is segmented, the labeling of programs by metadata is done using: 1.The metadata provided by the TV channels such as the EPG or EIT (e. g. [2,17,18]).2. The metadata extracted from the signal itself such as the speech transcripts (e. g. [24]), teletext or the recognition of opening and closing credits of some specified programs.In this article, the techniques of the literature are categorized into two main categories: 1.The first category contains methods based only on the analysis of metadata available with the stream.They will be noted metadata-based.The only method found in the literature is the one proposed by Poli et al. in [25].In this method, the audiovisual stream may be partially processed to enhance the prediction.2. The second category represents methods based on the analysis of the audiovisual stream.They will be noted content-based and can be categorized into two sub-classes:  The class of methods that search the boundaries of the programs themselves noted as programbased methods [16,19,21,26,27]. The class of methods that detect breaks that may separate consecutive programs noted as break-based methods [2,18,28].In the following section, we will present the different methods of the literature.Then, we will provide a summary of their advantages and disadvantages.Finally, we will conclude the section with the results obtained by each method and discuss its efficiency.

Metadata-based Methods
As we have stated, this category of methods uses only metadata to segment TV streams.It contains the method proposed by Poli et al. in [25].The idea is to rely on the fact that TV channels tries as much as possible to respect some regularity in the program plan to preserve and increase their audience.
Poli et al. propose an extension of the traditional HMM named Contextual HMM (CHMM) and uses a regression tree to predicts the start time, the duration and genre of programs and breaks during a week.In the CHMM, each node represents the genre of the program and the transition models the transition from one program genre to another one.The genre of a program does not depend on the genre of the precedent one but on the time of the day and the day of the week of the broadcast which is called the context of the broadcasted program.That's why Poli et al. propose an extension of the HMM named CHMM.Based on the context, a regression tree is used to predict the minimum, the maximum and the average duration of the broadcast.They use a one year of corrected EPGs to train the model and one week to test the system.
The idea of the Poli's work comes from the fact that the stream structure of a day in a week is very similar to the stream structure of the same day in the previous week.In addition to that, some part of the day is very similar to the same part in the previous day.Moreover, the start time, the duration and genre of programs are almost similar.For example, a news program starts always at the same time, has almost same duration and cannot be replaced by another program (except in some situation).However, the proposed method has several drawbacks: (1) It requires a huge amount of ground truth dataset to train the model; (2) It relies on the fact that TV channels have stable stream structure which is not always the case; (3) The efficiency of the prediction is 95% using a model learned on a one-year stream which requires additional step to improve the efficiency.
Other type of methods was proposed in the literature for program personalization and recommendation purposes (not structuring purpose) [29,30,31,32], for summaries program stream creation [33], or TV program indexing [34].

Content-based Methods
In this category, we can highlight two type of methods: program-based methods that focus on the detection of program boundaries and break-based methods that detect break segments.

Program-based Methods
One of the assumptions that some of the techniques of the literature based on is the fact that some programs start and end each day at the same time with the same opening and closing credits.That is why, Liang et al. proposes in [16] a method to construct a boundary model to detect repeated shots in different days.The model is then used to segment the stream into programs.Liang et al. test the proposed method on a 10 non-continuous TV streams recorded from the CCTV-4 channel from 17h00 to 21h00.Among the 10 streams, 4 are used to train the boundary model and 6 to test it.The results obtained in terms of precision and recall are approximately 100%.However, based on the following drawbacks, we think that this method is efficient if applied on a very special case of TV sub-streams but cannot be generalized on any TV stream.First of all, not all the programs have opening and closing credits.Secondly, authors have only considered the most structured parts of the day (from 17h00 till 21h00) while the other parts are less structured and probably contain programs without opening and closing credits.It would be interesting to consider the whole day instead of only this part.Thirdly, the model cannot detect commercials that may interrupt programs which makes it an incomplete macro-segmentation approach.Finally, the method does not propose any way to update the model in order to consider any possible change in the TV schedule.
Similarly, [21] based in his work on the same weak assumption considering that programs start and end with opening and closing credits.They consider also that such opening and closing credits and commercials contain frames with logos and with monochrome background and big text characters.They call these frames Program Oriented Informative iMages (POIM).The idea of this work is to detect these POIMs.In order to reduce false alarms, authors use auditory and textual information.An SVM classifier is used to find inter-program transitions and reject all other type of transitions such as commercials.The method is validated on the TRECVID 2005 corpus.Even though the method shows high efficiency, we should highlight the following: Inserting frames with logos and with monochrome background and big text is not a standard way to separate programs.In the absence of opening or closing credits, POIMs, or the miss-detection of POIMs, the consecutive programs will be combined.Secondly, if any POIMs are detected during a program, the program will be over-segmented.Finally, the approach is validated on TRECVID 2005 corpus which is not really TV streams.They are videos of same type with such specific assumptions.
In [16] and [21], the approaches proposed are supervised ones since the first create a supervised model to detect program boundaries and an SVM classifier in the second to retain inter-program boundaries.Since supervised models tend to lost precision with time and need updates and because such methods based on weak assumptions about TV production rules, El-Khoury et al. proposed an unsupervised method to detect boundaries between programs.They based on a stronger assumption which is a same program has homogeneous properties [19].Their idea is trying to detect the discontinuities of some audiovisual features.During the same program, these features are homogenous and can be modeled by a gaussian law.In a next program, the gaussian law is different than the previous gaussian law of the previous program.In order to detect the changement from one gaussian law to another one, authors use a GLR-BIC (Generalized Likelihood Ratio -Bayesian Information Criterion) audio segmentation method that was designed for speaker diarization [22].The method uses first visual features in order to detect possible transitions from one program to another one.Then, audio features are used and afterwards the two segmentations are merged together.However, the method shows that small segments such as break segments cannot be detected.Authors test their method on a real TV stream composed of 120 hours of French TV stream recorded continuously during 5 days.The results obtained are promising.The originality of the method is that it is unsupervised and can be used for several types of video analysis tasks such as speaker diarization, shot detection, program segmentation, etc.Moreover, the assumption used in the work is very strong.However, the method has two main drawbacks: The first is that short programs may not be detected and secondly that over-segmented programs are not later merged together.
The homogeneity property of features was also used by Haidar et al. in order to segment audiovisual documents using similarity matrices [26].The idea of the work is to measure the similarity between documents based on some styles [35] and has not as main aim to segment TV streams.The similarity measure can be applied in order to detect near-duplicate videos, to measure how much two videos are similar or to detect similar segments between two videos.In their work, a similarity matrix is generated per feature used and then all the similarity matrices are merged.As an application, the authors compare a long day stream with itself (auto-similarity) in order to structure it.The similarity matrix shows clearly the structure of the stream.The method has some main advantages: (1) The method is independent from any video type, the used features or the duration of the video document; (2) It is generic since it can be applied for TV stream macro-segmentation, video copy detection, video segmentation and other applications; (3) It is unsupervised that method that does not need any training step and the assumption they base on is very strong and can hardly change.The main drawback of the approach is that the authors do not provide any method to extract the structure from the autosimilarity matrix which is not trivial.
Recently, deep learning techniques were used by Hmayda et al. [27] in order to identify tv programs based on features learned by the auto-encoder algorithm.The idea is to recognize TV programs by learning their jingles.The idea here is to construct a training database of visual jingles for several types of TV programs.Then, the features of the various jingles are learned using the stacked sparse auto-encoder network.A 1490-images of four TV program types (News, Meteo, Sport, Documentary) were used in the training phase.The approach is tested on a total of 376 images and the efficiency of program identification reached 95%.Even though authors do not address the problem of TV segmentation, but the approach can be used to classify video frames into program frame or break frame.

Break-based Methods
The techniques of the literature showed that detecting the boundaries of programs is a hard task.That is why other techniques focused on detecting the breaks that may separate programs instead.They have based on the fact that most of TV channels usually separate consecutive programs by breaks or special type of audiovisual frames.The problem is that lot of TV channels interrupt their programs also by breaks.In such case, break-based techniques will segment also the same program into several parts and a way to merge them should be proposed.
As stated before, breaks can be of several types: commercials, trailers, station identification, bumpers.
to retain the meaningful repetitions in TV streams.The technique is applied on a 22-day TV stream.The first day is used to train the product quantization codes and the remaining to detect the repetitions.The method outperforms the traditional repetition detection methods but unfortunately it was not extended to segment the TV stream.

Comparison of Approaches: Results and Discussion
In order to be fair in the comparison of the proposed approaches, they should use the same dataset in their experimentations and provide the same evaluation measure.This was not the case except for the methods [18] and [17].The datasets used in the literature are variable (i.e. one day in [16] to 22 days in [18]).The evaluation measures are almost different (i.e.Precision, recall, F-measure, etc.).Even though, there are some characteristics that help us to compare approaches.We can list the following: -The type of the approach (metadata-based, program-based, or break-based).-The size and continuity of the dataset used to train and test the approach.Logically, a several-days dataset is better than several-hours one.Moreover, the continuity of the stream composing the dataset is among the important features since, from our knowledge in the domain, some parts of the day are more structured than others.For example, the period [18h00-22h00] is more structured than other parts because of the large number of audiences following the TV in this period.Structuring a stream built as the concatenation of several well-structured chunks of days is easier than taking several continuous days as they are broadcasted by the TV channel.For example, a 24-hours dataset composed of the chunk from 00h00 till 24h00 is continuous while the one composed of the concatenation of the chunks [18h00-22h00] of 6 days is not continuous even if the days are consecutive since the chunks [22h00-18h00] of the 6 days are missing.-The completeness of the approach.We mean by completeness that the approach handles all the steps of TV stream structuring or some of them.For example, several approaches in the literature do not annotate the segmented stream at all.-Learning-based approach or no.A learning-based approach is the one that needs during its process to build a model in order to structure the stream.Moreover, there is an important question which is if the built model degrades over time and need to be updated or no.

Datasets, Evaluation Measures, and Results
In this section, we will provide the reader the different datasets used in the literature to evaluate the structuring approaches, the evaluation measures used and the results obtained.
In Table 1, we list for each approach, the size of the dataset used for training and testing.
Before comparing the obtained results, it is important to list the different evaluation measures that was adopted by these approaches and how they are calculated.
-Precision, Recall, F-measure: Three types of these measures are adopted.Some have calculated them at program level, some have used the frame-level, while others have focused on the detection of boundaries.At program level, the precision is equal to the number of programs correctly found in the stream over the total number of programs found.
The recall is equal to the number of programs correctly found over the total number of programs that should be found.The F-measure is equal to: Testing [25] One year One week [16] 4 TV streams (17h00 to 21h00) 6 TV streams (17h00 to 21h00) [21] 3000 POIM images 5 TV streams from TRECVID 2005 (15h each) [19] One hour to tune GLR-BIC parameters 5 days [17] One day as RVD 20 days [2] One week to train the ILP rules 7 days [18] About 30% of annotated repeated sequences in a three weeks stream to train the classifiers 21 days

Papers
Measures used Results P/B Segmentation Labeling [25] Program-level precision & TA P= 97%, TA=17 sec P= 97% [16] Program-level precision and recall & TA P=95.8%,R=100%, TA=28 sec No labeling step is used [21] Boundary-level precision, recall and F-measure P=88%, R=91.5, F=89.2%No labeling step is used [19] ARGOS F-measure F=90.5% No labeling step is used [17] Frame-level precision of programs and breaks Fprogram≈99%, Fbreak≈90% Fprogram>88% and <96% [2] TA No segmentation step is used TA ≈ 3m35s [18] Frame-level F-measure of programs and breaks Fprogram≈98%, Fbreak≈90% Fprogram >90% and <96% Table 2 shows for each approach in the literature, which measure was used and the results obtained.For some approaches, the corpus was composed of several streams and some of them have calculated the measure for each of them.In such case, we have averaged the measures on the whole dataset.

Discussions
Since all the approaches do not use the same datasets and the same evaluation measures, the comparison task is not easy to do.However, we can highlight here some of the keys that will help the reader to make his own opinion about each of the proposed approached.
To do so, we have categorized the approaches of the literature into four categories: -Category 1: Contains the approaches that have no TV stream segmentation aims or their assumptions are not evident or very specific for some TV channels [16,21,26,27,48].-Category 2: Contains the approaches that base on an annotated video dataset [17] or needs a big annotated dataset to train the model [25].-Category 3: Contains the unsupervised TV stream segmentation approaches [19,23].-Category 4: Contains the approaches that substitute the pre-annotated video datasets with a stage that will learn the model from the raw data [2,18].One of the major drawbacks of the approaches of the first category is that some of them has no TV stream structuring aims such as [26,27,48].Moreover, the approaches [16,21] base on non-evident assumptions such as each program has an opening and closing credits that repeats from one day no another day which make them not applicable on any TV streams.
The approaches of the second category base on some big pre-annotated dataset.However, with time, the dataset becomes old and the accuracy of the system starts to decrease.Thus, the dataset should be always updated to allow the system maintaining its accuracy which is a tedious task.
In contrast to the above two categories, category 3 and category 4 gather, from our point of view, the efficient approaches.First of all, they are not very constrained with a priori information such as pre-annotated datasets or weak assumptions.Even though the approaches of the forth category learn some information from the raw data, this step is done once and its validity is much longer.The approaches (except [23]) are validated on a real continuous TV stream which make their results more realistic than the ones using several chunks of streams extracted from the most structured parts of the days.

Conclusion
In this article, we aim to introduce the reader the new level in the video content hierarchy which is the program level.Nowadays, the content of a video stream most probably does not belong to the same program.In order to apply any analysis step on streams, we should recover the structure of the stream into its composed programs.It is an obligatory step since most of the analysis tools available work on videos having the same content.In this article, we presented the reader a new level in the hierarchical video representation.This level is a result of a segmentation step aiming to recover the original structure of stream.We present here an up to date survey of the stream structuring approaches of the literature.For each approach, we have listed its advantages and its drawbacks.Then, we have presented the datasets used in these approaches, the evaluation measures and the results obtained.At the end, we opened a discussion about these approaches, highlighted some clues that may help the reader to conclude which are the most efficient ones.

Figure 1 .
Figure 1.ToC: The six-level video content hierarchy